<a href="https://colab.research.google.com/github/GorataB/Netflix-Data-Cleaning/blob/main/Netflix_Data_Cleaning_DSP_Gorata_Malose.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<h1 align="center">
    NSDC Data Science Projects
</h1>

<h2 align="center">
    Project: Netflix Data Cleaning
</h2>

<h3 align="center">
    Name: Gorata Malose
</h3>


### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

Certain parts of this project will be completed individually, while other parts are encouraged to be completed with the rest of your team. Each member of your team should work on their personal copy.

Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders.

You can now start working on the project. :)

**Project Description:**

This project will introduce students to an array of skills as they strive to access and prepare data for further analysis, a process referred to as data cleaning. Whenever data scientists work with any dataset, they must complete this process first to ensure the data is in a suitable format. In this project, students will be able to learn the process and apply it to a Netflix dataset. You should be able to apply this same process to all future datasets you would like to use for data science analysis.

[Use this link to join the NSDC DSP Slack Channel!](https://bit.ly/nsdc-dsp-movie-reviews)


---
---



<h3 align = "center">
    Milestone #1
</h3>

NOTE: These steps are to be completed **individually**, not as a team. You are encouraged to discuss steps with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to set up your environment, install the required packages, learn how to acces data and do some basic exploratory data analysis.

**Step 1:**

Setting up libraries and installing packages

To install a library:
```python
 import <library> as <shortname>
```
We use a *short name* since it is easier to refer to the package to access functions and also to refer to subpackages within the library.


In [None]:
import pandas as pd
import numpy as np

These are the libraries that will help us throughout this project. Here is the links to documentation for [Pandas](https://pandas.pydata.org/docs/) and [Numpy](https://numpy.org/doc/) that you can reference if you need help throughout the project as well.

We encourage you to read more about the important and most commonly used packages like Pandas and write a few lines in your own words about what they do. [You may use the Data Science Resource Repository (DSRR) to find resources to get started!](https://nebigdatahub.org/nsdc/data-science-resource-repository/)



<h4 style="color:orange">
    TO-DO
</h4>

Write a few lines about what each library does.

- **Pandas:** It loads data, cleans, transforms, visualises, and analyzes data.

- **NumPy:** It creates arrays and uses them to perform mathematical operations.


**Step 2:**

Let’s access our data. We will be using the Netflix Dataset from Kaggle. The dataset contains Netflix media and respective information about each of those movies and TV shows.


[The dataset is available at this link](https://www.kaggle.com/datasets/shivamb/netflix-shows). It is better to use the link provided directly within the read_csv function.

In order to utilize this dataset, you will have to download the dataset to your computer, unzip it, and upload it to the 'Files' tab of the Google Colab, which can be found on the left banner of the page. In order to access the Files tab, you must connect to a Runtime first. If you are unsure on how to do this, you can refer to this [YouTube video](https://www.youtube.com/watch?v=6HFlwqK3oeo) that will walk you through the steps.


We will use pandas to read the data from the csv file using the `read_csv` function. This function returns a pandas dataframe. We will store this dataframe in a variable called `df`.

In [None]:
# TODO: Read the data using pandas read_csv function
df = pd.read_csv('netflix_titles.csv')

**Step 3:**

Let's see what the data looks like. We can use the `head` function which returns the first 5 rows of the dataframe.

In [None]:
# TODO: Print the first 5 rows of the data using head function of pandas
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


There are 12 columns in the dataframe.

The `describe()` function gives us a summary of the data.

In [None]:
# TODO: Describe the data using describe function of pandas
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


Why does the `describe()` function only return a summary of 1 out of 12 of the columns?
In dataframes, there are different types of data that a column can store. Review those types on this [website](https://pbpython.com/pandas_dtypes.html).

Let's look at what data types each of these columns are storing using pandas' dtypes function.

In [None]:
# TODO: Observe the types of data in each column using the dtypes function of pandas
df.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

As you can see, there is only one numerical column, which is why the `describe()` function only returned information for one column. All of the other columns contain the `object` type, which is Python's version of a string or mixed variables.

---
---



<h3 align = "center">
    Milestone #2
</h3>

NOTE: These steps are to be completed **individually**, not as a team. You are encouraged to discuss steps with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to clean this dataset, so it is a format suitable for further data analysis.

**Step 1:**

The first step is to check if there are duplicate rows in the dataset and remove them. We can do that using the `duplicated()` function on the show_id column since that is a unique identifier. If two tows have the same show_id, then they are duplicates.

In [None]:
# Use the duplicated() function to return rows with the same show_id value
df.duplicated(subset=['show_id'])

0       False
1       False
2       False
3       False
4       False
        ...  
8802    False
8803    False
8804    False
8805    False
8806    False
Length: 8807, dtype: bool

Since there are no rows returned, there are no duplicates present in our dataset, and we can move on to the next step.

**Step 2:**

The next step is to check if there are any null values in your dataset. To do this, we can use the `isnull()` and `sum()` functions to count how many null values are present in each column of the data.

In [None]:
# Use the isnull() and sum() functions to return count of null values for each column in dataframe
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

As you can see, there are 6 columns that have null values. We must address each of these null values before moving forward. There are multiple ways to deal with null values, and you must choose which method you will proceed with based on the number of null values and the necessity of that column's data for your future analysis.

If that column is integral to your future analysis or there are a lot of null values, then you will want to figure out how to fill the values because you will lose a lot of valuable data if you remove those rows entirely. However, if that column is not important to you, and there are only a few null values, then you can go ahead and remove those rows from the dataset.

Here is [an article](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/) that discusses different ways to address null values. We are going to first drop rows that have null values for columns with a low number of nulls.

In [None]:
# Use the dropna() function and the 'subset' hyperparameter to remove rows in the columns that have 10 or less null value
df = df.dropna(subset = ["date_added", "rating", "duration"])

# Use the reset_index() function to reset the dataframe's index
df = df.reset_index()

# Check how many null values each column has again
df.isnull().sum()

index              0
show_id            0
type               0
title              0
director        2621
cast             825
country          829
date_added         0
release_year       0
rating             0
duration           0
listed_in          0
description        0
dtype: int64

We will now address the columns with large numbers of null values: `director`, `cast`, and `country`. Because there are too many values for us to find information for manually, we would lose a lot of data if we were to delete all of the corresponding rows, and we are unable to use numerical methods to fill in the data with a mean or median value, we will create new categories for these rows.

We can use the NumPy library to identify the nan or null values and the`replace()` function to replace the values with a string this. Remember to set the inplace hyperparamter to `True` so the values in the dataframe are replaced permanently!

In [None]:
# Use the replace function to replace the NA values in each of these columns with a new identifying string value
df['director'].replace(np.nan, 'this',inplace = True)
df['cast'].replace(np.nan, 'this', inplace=True)
df['country'].replace(np.nan, 'this',inplace = True)
df.isnull().sum()

index           0
show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

**Step 3:**

For columns that aren't numerical but will possibly have a small number of entires, there are some further checks. An example of this is the `type` column because there are only a few possibilities of what the type of media could be, but when analyzing the data, python would consider "movie" and "Movie" as two different values, so we need to look at what values are currently present and make them all consistent.

First we will use the `unique()` function to list all the current unique values in the column.

In [None]:
# Use the unique() function to print unique values in the type column
print(df.type.unique())

['Movie' 'TV Show']


The inputs in this column all have the same capitalization, so we do not need to make any changes. We will check similar columns, such as `rating` as well to make sure there are no necessary changes.

In [None]:
# Use the unique() function to print unique values in the rating column
print(df.rating.unique())

['PG-13' 'TV-MA' 'PG' 'TV-14' 'TV-PG' 'TV-Y' 'TV-Y7' 'R' 'TV-G' 'G'
 'NC-17' 'NR' 'TV-Y7-FV' 'UR']


**Step 4:**

In order to do make the columns easier to work with for analysis, it is vital to change the type of some of the columns from an object to a type of variable that is easier to analyze. For example, the date added will be much easier to use as a date variable, so let's change it!

First, return the contents of each column to see what the format is of the data stored.

In [None]:
# Return the date added column
df['date_added']

0       September 25, 2021
1       September 24, 2021
2       September 24, 2021
3       September 24, 2021
4       September 24, 2021
               ...        
8785     November 20, 2019
8786          July 1, 2019
8787      November 1, 2019
8788      January 11, 2020
8789         March 2, 2019
Name: date_added, Length: 8790, dtype: object

Let's now convert these into date objects. You can convert the `date_added` column to a datetime object using the `to_datetime()` function.

In [None]:
# Convert the date_added column to a datetime column using to_datetime
df['date_added'] = pd.to_datetime(df['date_added'])
# Check the current types of the columns to see if it changed
df.dtypes

index                    int64
show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
dtype: object

<h3 align = 'center' >
Your data is now ready for analysis! Thank you for completing the project!
</h3>

We will have future projects to walk you through how to analyze this cleaned data in python and tableau! Please check those out to continue to grow your data science skills!

Please do reach out to us if you have any questions or concerns. We are here to help you learn and grow.

If you have any queries, please contact the NSDC HQ Team at nsdc@nebigdatahub.org.
