### Data Analysis

With data analytics/science once you have gathered the data you want to analyse, the First thing to do is to clean that data. This is something that can take up to 80% of your time. Cleaning your data is part of a bigger process which is called Data Wrangling.

Data cleaning entails you ensuring that your data is in the right format for you to perform your analysis, this process includes some of the following:
1. Checking for null values
2. Deciding what to do with null values (replace or drop)
3. Checking for duplicates
4. Checking that the columns are in the right datatypes (integer, string, datetime)
5. separating the data into various smaller sets if needed

Remember each dataset is different and the manipulation you do to one to clean it most likely will be different from what you will do to clean a differnt dataset.


To help view and clean the data we are going to use the pandas package. To install it do the following:
1. Open command prompt
2. Run the command:
"python -m pip install pandas"



In [2]:
"""
It is common practice to import all necessary packages to work with at the beginning of your code
"""
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib as plt

##### Datasets

For this and the next set of notes, we will be using 2 datasets primarily but other will be ones will be introduced as well. The two datasets are:
a. diabetes.csv
b. daily-bike-share.csv

You can find other datasets to practice in different websites like Kaggle. Information can be stored in various format like csv, txt, xlsx (excel) and various others, knowing the format of your data is important so you can execute the write code to read and process it, else you will end up in errors 😖😩

#### Reading the data

In [3]:
"""
In Pandas, there are various ways of reading data depending on the format and when that data is read, it is stored in what we call a diabetes_df which consists of rows and
columns like you will see in an excel sheet. Our dataset is in csv format which is relatively easy to read as each column is separated by comma(,)
"""

# reading csv file using pandas
diabetes_df = pd.read_csv("Datasets\diabetes.csv")
bikeinfo_df = pd.read_csv("Datasets\daily-bike-share.csv")

# reading a file where the values are separated using a different symbol (for example |) is possible using the above statement, but the symbol or separator will need to be indicated


type(diabetes_df)

pandas.core.frame.DataFrame

In [4]:
# showing the contents of the diabetes_df
diabetes_df

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.282870,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0
...,...,...,...,...,...,...,...,...,...,...
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0


In [5]:
# a good way to look at your dataset is to look at the top rows of the dataframe

diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [6]:
# a good way to look at your dataset is to look at the botton rows of the dataframe
diabetes_df.tail()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0
14999,1386396,3,114,65,47,512,36.215437,0.147363,34,1


In [7]:
# we look at the shape of the dataset to determine the number of rows and columns
diabetes_df.shape

(15000, 10)

In [8]:
# checking the column datatypes
diabetes_df.dtypes

PatientID                   int64
Pregnancies                 int64
PlasmaGlucose               int64
DiastolicBloodPressure      int64
TricepsThickness            int64
SerumInsulin                int64
BMI                       float64
DiabetesPedigree          float64
Age                         int64
Diabetic                    int64
dtype: object

In [9]:
# checking for null values
diabetes_df.isnull()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
14995,False,False,False,False,False,False,False,False,False,False
14996,False,False,False,False,False,False,False,False,False,False
14997,False,False,False,False,False,False,False,False,False,False
14998,False,False,False,False,False,False,False,False,False,False


In [10]:
# unlike the command above, adding sum to it will compile and show the number of null values for each column
diabetes_df.isnull().sum()

PatientID                 0
Pregnancies               0
PlasmaGlucose             0
DiastolicBloodPressure    0
TricepsThickness          0
SerumInsulin              0
BMI                       0
DiabetesPedigree          0
Age                       0
Diabetic                  0
dtype: int64

In [11]:
# creating a new column and assigning zero's to all rows
diabetes_df["Status"] = 0

In [12]:
# checking to confirm that column was created
diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1,0
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,0


In [13]:
# checking the datatype of the new column created
diabetes_df["Status"].dtype

#output will show 'int64' which will signify integer, if it was 'O' then it will signify an object which is used to represent strings

dtype('int64')

In [14]:
# converting the datatype of the column from integer to string
diabetes_df["Status"] = diabetes_df["Status"].astype('str')

In [15]:
# checking the datatype of the new column after converting the datatype to string
diabetes_df["Status"].dtype

dtype('O')

##### Loc & iLoc

With pandas there is a nice way to locate certain values in your dataframe using the row and columns positions and names. There are 2 methods for this and they are called loc and iloc

loc: the names of the columns is to be used in finding a particular value; loc[rows,columns]
iloc: only indexes(integers) can be used to find a value; iloc[rows,columns]

In [16]:
#loc example
diabetes_df.loc[0:2,'PlasmaGlucose']

0    171
1     92
2    115
Name: PlasmaGlucose, dtype: int64

In [17]:
# an iloc example
diabetes_df.iloc[0:4,0:2]

Unnamed: 0,PatientID,Pregnancies
0,1354778,0
1,1147438,8
2,1640031,7
3,1883350,9


The code below is for us to categorise our data based on the age of the patient into old and young. Any patient with the age above 50 will be considered as old while those below will be considered as young

In [18]:
for row in range(len(diabetes_df)):
    if diabetes_df.loc[row,"Age"] > 50:
        diabetes_df.loc[row,"Status"] = "Old"
    else:
        diabetes_df.loc[row,"Status"] = "young"

In [19]:
diabetes_df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,young
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,young
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,young
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1,young
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,young


In [20]:
diabetes_df.tail()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1,young
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1,young
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0,young
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0,young
14999,1386396,3,114,65,47,512,36.215437,0.147363,34,1,young


In [21]:
"""
adding more categories to the status column
"""
for row in range(len(diabetes_df)):
    if diabetes_df.loc[row,"Age"] >= 50:
        diabetes_df.loc[row,"Status"] = "Old"
    elif diabetes_df.loc[row,"Age"] < 50:
        diabetes_df.loc[row,"Status"] = "Youth"
    elif diabetes_df.loc[row,"Age"] < 20:
        diabetes_df.loc[row,"Status"] = "Young"

#### Segmenting the data

With a large dataset, there are times where it is better to break it down into smaller datasets for better investigation. You can break the dataset down in multiple ways, for this diabetes dataset, we will be breaking it into new dataframes based on the Status category

In [22]:
# creating 3 new dataframes based on the status category
old_df = diabetes_df[diabetes_df["Status"].str.lower()=="old"]
young_df = diabetes_df[diabetes_df["Status"].str.lower()=="young"]
youth_df = diabetes_df[diabetes_df["Status"].str.lower()=="youth"]


In [23]:
print(f"The shape of the data for old category is {old_df.shape}")
print(f"The shape of the data for youth category is {youth_df.shape}")
print(f"The shape of the data for young category is {young_df.shape}")

The shape of the data for old category is (1446, 11)
The shape of the data for youth category is (13554, 11)
The shape of the data for young category is (0, 11)


#### Other checks you can perform on your data to ensure it is good

In [24]:
# check for any duplicates in your data as this is common

diabetes_df.duplicated().sum()

0

In [25]:
# dropping the duplicates which you found to prevent it from skewing your data
diabetes_df.drop_duplicates()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic,Status
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0,Youth
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0,Youth
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0,Youth
3,1883350,9,103,78,25,304,29.582192,1.282870,43,1,Youth
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0,Youth
...,...,...,...,...,...,...,...,...,...,...,...
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1,Youth
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1,Youth
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0,Youth
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0,Youth


In [26]:
# checking for unique values in a column in the dataset 
diabetes_df["Status"].unique()

array(['Youth', 'Old'], dtype=object)

In [28]:
# to get a basic quick summary of your data, you can use the describe function to view this
diabetes_df.describe()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,1502922.0,3.224533,107.856867,71.220667,28.814,137.852133,31.509646,0.398968,30.137733,0.333333
std,289253.4,3.39102,31.981975,16.758716,14.555716,133.068252,9.759,0.377944,12.089703,0.47142
min,1000038.0,0.0,44.0,24.0,7.0,14.0,18.200512,0.078044,21.0,0.0
25%,1252866.0,0.0,84.0,58.0,15.0,39.0,21.259887,0.137743,22.0,0.0
50%,1505508.0,2.0,104.0,72.0,31.0,83.0,31.76794,0.200297,24.0,0.0
75%,1755205.0,6.0,129.0,85.0,41.0,195.0,39.259692,0.616285,35.0,1.0
max,1999997.0,14.0,192.0,117.0,93.0,799.0,56.034628,2.301594,77.0,1.0


With the describe function above, you can quickly look at various information about each column in your dataset and see if there is anything that is off for you to work on. For example if you see that the minimum age is 5 and the maximum age is 100, you may decide to then restrict your data to a certain age range and consider those extreme ends(ages in this instance as anomalies)

#### Datetime

From my experience, when you encounter a dataset with a date column, already from that there is a lot of analysis that can be done such as one we call a time series. But a good thing to practice is to breakdown the date into month, year (if there are different years) and weekdays. Bear in mind, you may break it down into these 3 and find out that you do not need a column because it is all the same for example year, however that is still fine as this is good practice.

To look at how to work with datetime columns and spiltting them, the daily-bike-share data will be analysed below

In [30]:
bikeinfo_df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,rentals
0,1,1/1/2011,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331
1,2,1/2/2011,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131
2,3,1/3/2011,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120
3,4,1/4/2011,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108
4,5,1/5/2011,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82


By looking at the top rows of the data, we can see that there is a datetime column, but there is also 

Below a new dataset is used which contains information about car rental demand.

The link to the dataset is https://www.kaggle.com/datasets/ashwinshetgaonkar/analytics-vidhya-hackathon-april-2022 

In [33]:
# reading the data into a dataframe and showing the top rows
car_rental_df = pd.read_csv("Datasets\car_rental.csv")

car_rental_df.head()

Unnamed: 0,date,hour,demand
0,2021-03-01,0,0
1,2021-03-01,1,0
2,2021-03-01,2,0
3,2021-03-01,3,0
4,2021-03-01,5,0


In [34]:
# checking that the datatypes are as expected
car_rental_df.dtypes

date      object
hour       int64
demand     int64
dtype: object

From the above check, we can see that the date column is a string variable and not datetime. It will be better if we convert it to datetime variable for future analysis

In [35]:
# converting a column to a datetime column
car_rental_df["date"] = pd.to_datetime(car_rental_df["date"])

In [36]:
# confirming that conversion was done
car_rental_df.dtypes

date      datetime64[ns]
hour               int64
demand             int64
dtype: object

The next step will be to create 3 new columns containing the month, year and the day of the week. This will enable us to group the data as we deem fit and analyse as well to find if there is any patterns.

In [41]:
#creating new columns and filling them respectively by extracting the information from the date column
car_rental_df["month"] = car_rental_df["date"].dt.month 
car_rental_df["year"] = car_rental_df["date"].dt.year   
car_rental_df["weekday"] = car_rental_df["date"].dt.day_name()

In [42]:
# confirming the creation of the new columns
car_rental_df.head()

Unnamed: 0,date,hour,demand,month,year,weekday
0,2021-03-01,0,0,3,2021,Monday
1,2021-03-01,1,0,3,2021,Monday
2,2021-03-01,2,0,3,2021,Monday
3,2021-03-01,3,0,3,2021,Monday
4,2021-03-01,5,0,3,2021,Monday


With this you can see how new columns have been created by extracting information from another column which in this case is the date column

#### THE END