# Data Cleaning - Basics

This notebook references the following tutorials
*  Alex The Analyst's YouTube [Data Cleaning in Pandas | Python Pandas Tutorials](https://youtu.be/bDhvCp3_lYw?si=4IaQlklqX3srBNba)
*  Learn with Ankith's YouTube - [Data Cleaning/Data Preprocessing Before Building a Model - A Comprehensive Guide](https://youtu.be/GP-2634exqA?si=fmqFTzw5lsTfIadQ)
*  ChatGPT [query it to make your own DC tutorial](https://chat.openai.com/auth/login)

We are going to accomplish the following tasks in this notebook
*  Import the dataset into google colab from the desktop
*  Import data handling libraries
*  Import the dataset into pandas
*  Display the number of rows and columns of the dataset
*  Display the first 5 rows of the dataset
*  Display descriptitve statistics for the dataset
*  Display sum of null values in the dataset
*  Display sum of duplicate rows in dataset
*  Remove duplicate rows from the dataset
*  Display again the number of rows and columns of the dataset

We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
*  Download the [Kaggle life expectancy data.csv dataset](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who) to your desktop
*  Click on the folder on left side of the approximate middle of the Colab screen
*  Drag and drop the train.csv file into the folder to upload it to Google Colab from your desktop
*  You will need to do this operation everytime you use the notebook

Our first script loads the following libraries:

*  [Matplotlib](https://matplotlib.org/) version 3.8.2, dated 17 Nov, 2023
*  [Numpy](https://numpy.org/) version 1.26.0, dated 16 September 2023
*  [Pandas](https://pandas.pydata.org/) version 2.14, dated 8 December 2023
*  [Seaborn](https://seaborn.pydata.org/) version v 0.13.1, dated 2021

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

Our next script:
*  Uses the Pandas library
*  Loads the data
*  Uses the [.shape method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)
*  Return a tuple representing the dimensionality (rows, columns) of the DataFrame

In [3]:
# Load data
life_data = pd.read_csv('Life Expectancy Data.csv')

# Display the number of rows and columns comprising the dataset
life_data.shape

(2938, 22)

This script:
*  Uses the Pandas library
*  Uses the [.head method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
*  Displays the first n rows of the dataset
*  n = 5 is the default

In [4]:
# Display the first five rows of the dataset
life_data.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


Our next script:
*  Uses the Pandas library
*  Uses the [.describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
*  Generate descriptive statistics.
*  Descriptive statistics include those that summarize the central tendency, dispersion and shape of a datasets distribution, excluding NaN values



In [5]:
# Display descriptive statistics about the dataset
life_data.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


Our next script:
* Uses Pandas
* Uses the [.isnull](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) method
* Uses the [.sum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) method
* Displays the sum of the number of null values in the dataset

In [9]:
life_data.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

Our next script:
* Uses the Pandas library
* Uses the [.duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html) method
* Uses the [.sum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) method
* Displays the sum of the duplicate rows

In [12]:
# Counts the number of duplicate rows
life_data.duplicated().sum()

0

Our next script:
* Uses the Pandas library
* Uses the [.drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) method
* Returns DataFrame with duplicate rows removed
* Display the number of rows and columns comprising the dataset


In [13]:
# Remove duplicate rows from the dataset
life_data.drop_duplicates()

# Display the number of rows and columns comprising the dataset
life_data.shape

(2938, 22)