### Data Analysis

With data analytics/science once you have gathered the data you want to analyse, the First thing to do is to clean that data. This is something that can take up to 80% of your time. Cleaning your data is part of a bigger process which is called Data Wrangling.

Data cleaning entails you ensuring that your data is in the right format for you to perform your analysis, this process includes some of the following:
1. Checking for null values
2. Deciding what to do with null values (replace or drop)
3. Checking for duplicates
4. Checking that the columns are in the right datatypes (integer, string, datetime)
5. separating the data into various smaller sets if needed

Remember each dataset is different and the manipulation you do to one to clean it most likely will be different from what you will do to clean a differnt dataset.


To help view and clean the data we are going to use the pandas package. To install it do the following:
1. Open command prompt
2. Run the command:
"python -m pip install pandas"



In [44]:
"""
It is common practice to import all necessary packages to work with at the beginning of your code
"""
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib as plt

##### Datasets

For this and the next set of notes, we will be using 2 datasets primarily but other will be ones will be introduced as well. The two datasets are:
a. diabetes.csv
b. daily-bike-share.csv

You can find other datasets to practice in different websites like Kaggle. Information can be stored in various format like csv, txt, xlsx (excel) and various others, knowing the format of your data is important so you can execute the write code to read and process it, else you will end up in errors 😖😩

#### Reading the data

In [45]:
"""
In Pandas, there are various ways of reading data depending on the format and when that data is read, it is stored in what we call a diabetes_df which consists of rows and
columns like you will see in an excel sheet. Our dataset is in csv format which is relatively easy to read as each column is separated by comma(,)
"""

# reading csv file using pandas
diabetes_df = pd.read_csv("diabetes.csv")
bikeinfo_df = pd.read_csv("daily-bike-share.csv")

# reading a file where the values are separated using a different symbol (for example |) is possible using the above statement, but the symbol or separator will need to be indicated


type(diabetes_df)

pandas.core.frame.DataFrame

In [46]:
# showing the contents of the diabetes_df
diabetes_df

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.282870,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0
...,...,...,...,...,...,...,...,...,...,...
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0


In [47]:
# a good way to look at your dataset is to look at the top rows of the dataframe

diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [48]:
# a good way to look at your dataset is to look at the botton rows of the dataframe
diabetes_df.tail()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0
14999,1386396,3,114,65,47,512,36.215437,0.147363,34,1


In [49]:
# we look at the shape of the dataset to determine the number of rows and columns
diabetes_df.shape

(15000, 10)

In [50]:
# checking the column datatypes
diabetes_df.dtypes

PatientID                   int64
Pregnancies                 int64
PlasmaGlucose               int64
DiastolicBloodPressure      int64
TricepsThickness            int64
SerumInsulin                int64
BMI                       float64
DiabetesPedigree          float64
Age                         int64
Diabetic                    int64
dtype: object

In [51]:
# checking for null values
diabetes_df.isnull()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
14995,False,False,False,False,False,False,False,False,False,False
14996,False,False,False,False,False,False,False,False,False,False
14997,False,False,False,False,False,False,False,False,False,False
14998,False,False,False,False,False,False,False,False,False,False


In [52]:
# unlike the command above, adding sum to it will compile and show the number of null values for each column
diabetes_df.isnull().sum()

PatientID                 0
Pregnancies               0
PlasmaGlucose             0
DiastolicBloodPressure    0
TricepsThickness          0
SerumInsulin              0
BMI                       0
DiabetesPedigree          0
Age                       0
Diabetic                  0
dtype: int64

In [53]:
# creating a new column and assigning zero's to all rows
diabetes_df["Status"] = 0

In [54]:
# checking to confirm that column was created
diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1,0
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,0


In [55]:
# checking the datatype of the new column created
diabetes_df["Status"].dtype

#output will show 'int64' which will signify integer, if it was 'O' then it will signify an object which is used to represent strings

dtype('int64')

In [56]:
# converting the datatype of the column from integer to string
diabetes_df["Status"] = diabetes_df["Status"].astype('str')

In [57]:
# checking the datatype of the new column after converting the datatype to string
diabetes_df["Status"].dtype

dtype('O')

##### Loc & iLoc

With pandas there is a nice way to locate certain values in your dataframe using the row and columns positions and names. There are 2 methods for this and they are called loc and iloc

loc: the names of the columns is to be used in finding a particular value; loc[rows,columns]
iloc: only indexes(integers) can be used to find a value; iloc[rows,columns]

In [61]:
#loc example
diabetes_df.loc[0:2,'PlasmaGlucose']

0    171
1     92
2    115
Name: PlasmaGlucose, dtype: int64

In [65]:
# an iloc example
diabetes_df.iloc[0:4,0:2]

Unnamed: 0,PatientID,Pregnancies
0,1354778,0
1,1147438,8
2,1640031,7
3,1883350,9


The code below is for us to categorise our data based on the age of the patient into old and young. Any patient with the age above 50 will be considered as old while those below will be considered as young

In [60]:
for row in range(len(diabetes_df)):
    if diabetes_df.loc[row,"Age"] > 50:
        diabetes_df.loc[row,"Status"] = "Old"
    else:
        diabetes_df.loc[row,"Status"] = "young"

In [33]:
diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,young
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,young
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,young
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1,Old
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,young


In [34]:
diabetes_df.tail()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1,Old
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1,young
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0,young
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0,young
14999,1386396,3,114,65,47,512,36.215437,0.147363,34,1,young


In [78]:
"""
adding more categories to the status column
"""
for row in range(len(diabetes_df)):
    if diabetes_df.loc[row,"Age"] >= 50:
        diabetes_df.loc[row,"Status"] = "Old"
    elif diabetes_df.loc[row,"Age"] < 50:
        diabetes_df.loc[row,"Status"] = "Youth"
    elif diabetes_df.loc[row,"Age"] < 20:
        diabetes_df.loc[row,"Status"] = "Young"

#### Segmenting the data

With a large dataset, there are times where it is better to break it down into smaller datasets for better investigation. You can break the dataset down in multiple ways, for this diabetes dataset, we will be breaking it into new dataframes based on the Status category

In [80]:
# creating 3 new dataframes based on the status category
old_df = diabetes_df[diabetes_df["Status"].str.lower()=="old"]
young_df = diabetes_df[diabetes_df["Status"].str.lower()=="young"]
youth_df = diabetes_df[diabetes_df["Status"].str.lower()=="youth"]


In [81]:
print(f"The shape of the data for old category is {old_df.shape}")
print(f"The shape of the data for youth category is {youth_df.shape}")
print(f"The shape of the data for young category is {young_df.shape}")

The shape of the data for old category is (1446, 11)
The shape of the data for old category is (0, 11)
The shape of the data for old category is (13554, 11)


In [43]:
diabetes_df.duplicated().sum()

0

In [44]:
diabetes_df.drop_duplicates()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,young
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,young
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,young
3,1883350,9,103,78,25,304,29.582192,1.282870,43,1,Old
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,young
...,...,...,...,...,...,...,...,...,...,...,...
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1,Old
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1,young
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0,young
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0,young


In [46]:
diabetes_df["Status"].unique()

array(['young', 'Old'], dtype=object)

In [47]:
diabetes_df["PlasmaGlucose"].unique()

array([171,  92, 115, 103,  85,  82, 133,  67,  80,  72,  88,  94, 114,
       110, 148, 109, 106, 156, 117, 102, 118, 124,  44, 104, 135, 163,
       119,  70,  75, 152, 149, 123,  74,  45, 108,  76, 165,  73, 105,
        50, 134,  90, 121, 125,  78,  56,  87, 100, 137,  91,  79, 138,
       157,  68,  77,  71,  69, 144,  57,  93,  55,  84,  47, 116,  97,
       127, 167, 107,  96, 151, 143, 132,  83, 120, 188, 140,  54, 170,
        89, 128, 145, 112, 122, 169, 131,  99,  98, 139,  58, 153, 168,
       173, 176, 154, 113,  59,  81, 174, 146,  53, 141, 172, 129, 142,
        86, 178, 101, 166, 155, 183, 177,  95,  62, 147, 111, 136,  51,
        63, 150,  65, 175,  52, 126,  46,  66, 162,  49, 160, 180, 159,
        48, 182, 179, 192, 130, 158, 164, 181, 190, 187,  61,  60, 161,
       185, 186, 189, 184, 191,  64], dtype=int64)

In [50]:
dataframe.describe()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,1502922.0,3.224533,107.856867,71.220667,28.814,137.852133,31.509646,0.398968,30.137733,0.333333
std,289253.4,3.39102,31.981975,16.758716,14.555716,133.068252,9.759,0.377944,12.089703,0.47142
min,1000038.0,0.0,44.0,24.0,7.0,14.0,18.200512,0.078044,21.0,0.0
25%,1252866.0,0.0,84.0,58.0,15.0,39.0,21.259887,0.137743,22.0,0.0
50%,1505508.0,2.0,104.0,72.0,31.0,83.0,31.76794,0.200297,24.0,0.0
75%,1755205.0,6.0,129.0,85.0,41.0,195.0,39.259692,0.616285,35.0,1.0
max,1999997.0,14.0,192.0,117.0,93.0,799.0,56.034628,2.301594,77.0,1.0
