# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [4]:
# Import the appropriate libraries

import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation

%matplotlib inline 

from matplotlib import pyplot as plt

# Create dataframe from csv

mv_df1 = pd.read_csv("Netflix subscription fee Dec-2021.csv")

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [5]:
mv_df1.head()

Unnamed: 0,Country_code,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
0,ar,Argentina,4760,3154,1606,3.74,6.3,9.26
1,au,Australia,6114,4050,2064,7.84,12.12,16.39
2,at,Austria,5640,3779,1861,9.03,14.67,20.32
3,be,Belgium,4990,3374,1616,10.16,15.24,20.32
4,bo,Bolivia,4991,3155,1836,7.99,10.99,13.99


In [6]:
mv_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country_code                   65 non-null     object 
 1   Country                        65 non-null     object 
 2   Total Library Size             65 non-null     int64  
 3   No. of TV Shows                65 non-null     int64  
 4   No. of Movies                  65 non-null     int64  
 5   Cost Per Month - Basic ($)     65 non-null     float64
 6   Cost Per Month - Standard ($)  65 non-null     float64
 7   Cost Per Month - Premium ($)   65 non-null     float64
dtypes: float64(3), int64(3), object(2)
memory usage: 4.2+ KB


In [7]:
mv_df1.describe()

Unnamed: 0,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
count,65.0,65.0,65.0,65.0,65.0,65.0
mean,5314.415385,3518.953846,1795.461538,8.368462,11.99,15.612923
std,980.322633,723.010556,327.279748,1.937819,2.863979,4.040672
min,2274.0,1675.0,373.0,1.97,3.0,4.02
25%,4948.0,3154.0,1628.0,7.99,10.71,13.54
50%,5195.0,3512.0,1841.0,8.99,11.49,14.45
75%,5952.0,3832.0,1980.0,9.03,13.54,18.06
max,7325.0,5234.0,2387.0,12.88,20.46,26.96


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [8]:
## Check for Null values using isnull() function
mv_df1.isnull().sum()

Country_code                     0
Country                          0
Total Library Size               0
No. of TV Shows                  0
No. of Movies                    0
Cost Per Month - Basic ($)       0
Cost Per Month - Standard ($)    0
Cost Per Month - Premium ($)     0
dtype: int64

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [9]:
mv_df1 = mv_df1.drop(['Country_code'], axis=1)
mv_df1.head(5)

Unnamed: 0,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
0,Argentina,4760,3154,1606,3.74,6.3,9.26
1,Australia,6114,4050,2064,7.84,12.12,16.39
2,Austria,5640,3779,1861,9.03,14.67,20.32
3,Belgium,4990,3374,1616,10.16,15.24,20.32
4,Bolivia,4991,3155,1836,7.99,10.99,13.99


In [14]:
# Dropping the missing values.
mv_df1 = mv_df1.dropna()    
mv_df1.count()

Country                          65
Total Library Size               65
No. of TV Shows                  65
No. of Movies                    65
Cost Per Month - Basic ($)       65
Cost Per Month - Standard ($)    65
Cost Per Month - Premium ($)     65
dtype: int64

In [15]:
#Drop Duplicates
mv_df1 = mv_df1.drop_duplicates()
mv_df1.head(5)

Unnamed: 0,Country,Total Library Size,No. of TV Shows,No. of Movies,Cost Per Month - Basic ($),Cost Per Month - Standard ($),Cost Per Month - Premium ($)
0,Argentina,4760,3154,1606,3.74,6.3,9.26
1,Australia,6114,4050,2064,7.84,12.12,16.39
2,Austria,5640,3779,1861,9.03,14.67,20.32
3,Belgium,4990,3374,1616,10.16,15.24,20.32
4,Bolivia,4991,3155,1836,7.99,10.99,13.99


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [12]:
#Fix capitalization inconsistencies
MV = mv_df1['Country'].unique()
# sort them alphabetically and then take a closer look
MV.sort()

MV

array(['Argentina', 'Australia', 'Austria', 'Belgium', 'Bolivia',
       'Brazil', 'Bulgaria', 'Canada', 'Chile', 'Colombia', 'Costa Rica',
       'Croatia', 'Czechia', 'Denmark', 'Ecuador', 'Estonia', 'Finland',
       'France', 'Germany', 'Gibraltar', 'Greece', 'Guatemala',
       'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan', 'Latvia',
       'Liechtenstein', 'Lithuania', 'Malaysia', 'Mexico', 'Moldova',
       'Monaco', 'Netherlands', 'New Zealand', 'Norway', 'Paraguay',
       'Peru', 'Philippines', 'Poland', 'Portugal', 'Romania', 'Russia',
       'San Marino', 'Singapore', 'Slovakia', 'South Africa',
       'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Taiwan',
       'Thailand', 'Turkey', 'Ukraine', 'United Kingdom', 'United States',
       'Uruguay', 'Venezuela'], dtype=object)

In [17]:
#Renaming the columns
mv_df1 = mv_df1.rename(columns={"Total Library Size": "Libraries", "No. of TV Shows": "Shows", "No. of Movies": "Movies", "Cost Per Month - Basic ($)": "Basic ($)","Cost Per Month - Standard ($)": "Standard ($)", "Cost Per Month - Premium ($)": "Premium ($)" })
mv_df1.head(5)

Unnamed: 0,Country,Libraries,Shows,Movies,Basic ($),Standard ($),Premium ($)
0,Argentina,4760,3154,1606,3.74,6.3,9.26
1,Australia,6114,4050,2064,7.84,12.12,16.39
2,Austria,5640,3779,1861,9.03,14.67,20.32
3,Belgium,4990,3374,1616,10.16,15.24,20.32
4,Bolivia,4991,3155,1836,7.99,10.99,13.99


In [None]:
1.yes .
2.yes,it told me how to make data consistent and useful.
3. Observe the data carefully so that not important things we miss ,
also diffrentiate the relevancy and irrlevancy (junk ) data diffrence.


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
2. Did the process of cleaning your data give you new insights into your dataset?
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?