Exploratory Data Analysis Project using Python


EXploratory Data Analysis(EDA) is an important step in any data analysis it involves perform investigation on a given dataset to check anomalies, patterns and critical understanding of the dataset. 

It involves numerical summary and graphically presentation of the dataset. It helps a data scientist make sense of data before solving hypothesises and getting insights in a project.

In this article i will try explain concepts and steps i undertook to perform simple EDA on Safety of Transport dataset collected to document security pain points and strategies adopted by different cities in the world to ensure inclusive development in public transit systems. 


Importing Libraries


To start with I imported all the necessary libraries for the assignemnt. Some of the libraries thati have used are Pandas, numpy, matplotlib, seaborn.

Panda is a library for manipulation of data. It is an open source library which provides high performance, easy to use data structure for Python Programming.

Numpy is a library that comes in when working with arrays. It provides multidimentional matrices and arrays for large data set. It comes with many mathematical functions to work on these arrays. 

Matplotlib is a library that assists in graphical and visualization work. Matplotlib and numpy work hand in hand. Together they create an alternative to Matlab.

Seaborn is also an alternative library for data visualization and is based on matplotlib.

OS library is used for interacting with your machines operating system. It maps scripts to the actual environments that you are working on incase you change location of the project in your computer drive. This module is optional and I just have it there for my own. 



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
pwd=os.getcwd()
%matplotlib inline 



Here is an outline of what we are going to cover on our script. 
* Getting/Importing the data
* Data Preparation and Cleaning
* Exploratory analysis and Visualizations.
* Summary  

Reading Data & Checking

Here we us the Pandas library to import the required dataset into python. You can read many formats of data using the Pandas library (CSV,html,Json etc)

In [None]:
data= pd.read_csv ('G:/Transport Safety Project/Public_Transit_Survey.csv')

To check the data and how it looks like we use the .head() command to check the first five rows of the dataset and .tail() to check the last five rows of the dataset.



In [None]:
data.head ()


In [None]:
data.tail ()


Also to see the whole dataset you need to recall the name given to the dataframe in this case we recall (data).

In [None]:
data

Checking the data type

It is important to check the data types of each column. The dtype returns a series with data type of each column.

In [None]:
data.dtypes
 

Does the same thing as the dtype code. The only difference is the df.info() command prints additional infomation (non-null values, memory usage)

In [None]:
data.info()

Checking the variables/Columns names on our Data

This is how we get names of data frames. Its important for later on when doing some certain operation especially when having a huge dataset.
There are several ways of doing this:
* Sorted() method
* tolist() method
* column.values method
* keys() function
* columns attritube with dataframe object
* Iterating over columns

For this project we shall use this two (iterating over columns method) and the (columns attribute with dataframe object) as examples.


In [None]:
#columns attribute with dataframe object
print(data.columns)


In [None]:
#Iterating over columns method
for col in data.columns:
    print(col)

Checking for missing Data

You will always get missing data when no information is provided on one or more item in a unit. We get missing data because either it never existed or it was not collected.
In Pandas mising data is represented by two values 
* NaN
* None
Pandas uses the two NaN and None interchangebly to indicate null or missing values.



There are multiple function for detecting, replacing and removing null values in a dataframe:
* notnull()
* dropna()
* replace()
* isnull()
* interpolate()
* fillna()



In [None]:
data.isnull().tail() # .isnull() checks for null values. .tail() brings the last five rows.


.isnull () method checks and manage Null values in a data frame

In [None]:
missing_values_count = data.isnull().sum()

You can sum up the number of missing values in all rows by adding the .sum() function after the .isnull() after the function

In [None]:
missing_values_count[0:20]

Droping Columns.

When performing EDA it is important to drop unneccesary columns those that will not contribute much on our analysis. To know which column to drop you need to understand the data very well. So before droping and analysis make sure you have a clear understanding of the data.

To drop column we us the .drop() functon which remove columns and rows by specifying the exact names and corresponding axis.

In [None]:
data= data.drop(['group1-note', 'group1-note2', 'location-Accuracy', 'meta-instanceID', 'KEY', 'SubmitterID','SubmitterName', 'AttachmentsPresent', 'AttachmentsExpected', 'Status','ReviewState', 'DeviceID', 'Edits'], axis=1)


In [None]:
data

Data Cleaning/dealing with NA data

The next process is dealing with the NA values. The NA values are those values that are missing. We have to know what to do with missing values. Missing values interferes with our analysis. Missing values can either be removed or replaced depending on the data. It is important to ask about missing values from Data owners before you do anything about the missing data. 

fillna() function is used to replaces missing values in a data frame

In [None]:
for column in data.columns:
    data[column].fillna(data[column].mode()[0], inplace=True)

In [None]:
data

In [None]:
#data.fillna(method = 'bfill', axis=0).fillna(0)

Change value names

Part of data preparation is editing some values in a table that seemed incorectly spelt or written. In my data in the column frequency oftenly had been misspelt. For the data to give us a clear value we needed to do something about that. 

loc() is a function that can be used to update value of a row with respect to the column by providing the labels of the columns and the index of the rows.

In [None]:
data.loc[data['frequency']== "only", "frequency"]= "oftenly"

In [None]:
#data

Counts/Groupby

As the name suggest the groupby() function involves several operations at ones; splitting the odject, applying a function, and combinng the results. 

It returns a groupby object which contains information about the groups.

In [None]:
count= data['gender'].value_counts()
print(count)

In [None]:
count= data['frequency'].value_counts(normalize=True)
print(count)

In [None]:
Double_Counts = data.groupby('gender')['frequency'].value_counts(normalize= False)
Double_Counts

In [None]:
Double_Counts = data.groupby('forms')['harassment'].value_counts(normalize= True)
Double_Counts

In [None]:
Double_Counts = data.groupby('gender')['gender_safety'].value_counts(normalize= True)
Double_Counts

In [None]:
#data.value_counts()

Convert the Dataframe to CSV

This last code is used to convert the Data to CSv. When you want to take a clean data to other tools for maybe visuaization purpose you can convert the dataframe to CSV format using .to_CSV() function. I have saved it in Drive G on a folder named Transport safety Project.

In [None]:
data.to_csv( r'G:\Transport Safety Project\Clean Data.csv', index=False)



In [None]:
#data.to_excel(pwd +"final_output.xlsx",index=False)