## Netflix Dataset: Cleaning and Visualisation
_Author: Tshepo Ralehoko_ \
_Date: 19 July 2023_

#### Objective 

The purpose of this project is to use the __Netflix__ dataset obtained from [Kaggle](https://www.kaggle.com/datasets) website to answer the questions that listed below. Additionally, the aim of the project is to visual any interesting insights. The following insights shall be explored: 



#### Tech Stack

__Programming language: Python__ \
__Libraries:__ Pandas

### Code

#### Importing modules

In [3]:
# importing the requiste modules

import pandas as pd

#### Reading dataset

In [4]:
# loading the dataset

df=pd.read_csv(filepath_or_buffer=r'C:\Users\rtral\Datasets\netflix data\netflix_titles.csv')

#### Data Cleaning and Manipulation

In [5]:
# number of rows and columns

print('this dataset has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

this dataset has 7787 rows and 12 columns


In [6]:
# the maximum width in characters of a column 

pd.set_option('display.max_colwidth', None)

In [7]:
# inspecting a few observations
# column are clearly visible in the print out below
# missing values noted for the director column

df.head(n=2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi & Fantasy","In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join the 3% saved from squalor."
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies","After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued while trying desperately to stay alive."


#### Columns guide:

1. __show_id:__ Show identifier
2. __type:__ TV show or movie.
3. __title:__ Title of the TV show or movie.
4. __director:__ Director of the TV show or movie.
5. __cast:__ Overall cast members.
6. __country:__ Country released in.
7. __date_added:__ Date added on __Netflix__.
8. __release_year:__ The release year.
9. __rating:__ Rating on __Netflix__.
10. __duration:__ Duration of TV show or movie.
11. __listed_in:__ A list of genres. 
12. __description:__ A brief description of TV show or movie.


In [8]:
print('the number and percentage of missing values by column \n')

def missing_values(df):
    
    # count of null values for each column
    null_count=df.isnull().sum().sort_values(ascending=False)
    
    # percent count of null values 
    null_rate=100*(null_count/len(df)).round(decimals=4)
    
    # tabling the results
    miss_tbl=pd.concat([null_count, null_rate], axis=1)
    
    # renaming the columns the above table    
    col_renaming=miss_tbl.rename(
    columns = {0: 'null_count', 1: 'null_rate'})
    
    # returning the final table
    
    return col_renaming
      
# printing out the results

print(missing_values(df))

the number and percentage of missing values by column 

              null_count  null_rate
director            2389      30.68
cast                 718       9.22
country              507       6.51
date_added            10       0.13
rating                 7       0.09
show_id                0       0.00
type                   0       0.00
title                  0       0.00
release_year           0       0.00
duration               0       0.00
listed_in              0       0.00
description            0       0.00


In [9]:
# taking a look at the datatypes of the columns

print(df.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [10]:
# duplicates

print('the number of duplicated rows:', df.duplicated().sum())

the number of duplicated rows: 0


In [11]:
# the minimum and maximum number of characters for each string variable

cols=[]
min_count=[]
max_count=[]

for col in df.columns:
    
    # consider only columns that are of string datatype
    if df[col].dtype == 'object':
         
        min_len=df[col].str.len().min()
        max_len=df[col].str.len().max()
        
        # the list of all columns that are of string datatype
        cols.append(col)
        
        # minimum and maximum string length for each column
        min_count.append(min_len)
        max_count.append(max_len)

df_chars=pd.DataFrame({'Column': cols, 'Min_count': min_count, 'Max_count': max_count})

df_chars[['Min_count','Max_count']]=df_chars[['Min_count','Max_count']].astype(int)


print('the minimum count and maximum count of the number of characters in each string variable \n')

df_chars

the minimum count and maximum count of the number of characters in each string variable 



Unnamed: 0,Column,Min_count,Max_count
0,show_id,2,5
1,type,5,7
2,title,1,104
3,director,2,208
4,cast,3,771
5,country,4,123
6,date_added,11,19
7,rating,1,8
8,duration,5,10
9,listed_in,6,79


In [12]:
df['title'].str.len()
df['title']

0                                            3%
1                                          7:19
2                                         23:59
3                                             9
4                                            21
                         ...                   
7782                                       Zozo
7783                                     Zubaan
7784                          Zulu Man in Japan
7785                      Zumbo's Just Desserts
7786    ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS
Name: title, Length: 7787, dtype: object

In [19]:
for title in df['title']:
    if len(title) < 4:
        title=title
        title_len=len(title)
        pd.Datafra


int

#### Addressing Missing Values

\
In this section we shall address missing values.

In [None]:
df[['show_id', 'type', 'title']]=df[['show_id', 'type', 'title']].astype('str')
