Title: 2.2 Exercises

Author: Chad Wood

Date: 3 Mar 2022

Modified By: Chad Wood

Description: This program demonstrates working knowledge of cleaning techniques nessesary for exploratory data analysis. Please bare in mind that the profile report generated will not be displayed by default in saved notebooks. I have placed dataset heads at the bottom of this file so that compliance to instruction may be observed.

##### Instructions
Perform EDA on the eda_data.csv file including descriptive analytics and some basic data cleaning.

For the data cleansing, make each column a number - noting that things like "(" and "%" are not part of a number and in fact change a number to a string. You may have to use Python to case the object to float.

Use Pandas Profiling to create a profiling report on the eda_data_small.csv file (if you run Pandas Profiling on the full .csv file, it will take an excessive amount of time to run).

##### Dependancies and Data

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport

In [2]:
eda_data = pd.read_csv('data/eda_data.csv')
eda_data_small = pd.read_csv('data/eda_data_small.csv')

##### Operation

In [3]:
import re

# Function performs all cleaning opperations
def str_to_int(series):
    
    # Replaces NaN with string '0'. Instances converted to float later to prevent error
    series = series.fillna('0')
    
    # Removes characters preventing conversion to float
    series = series.replace('[$|,]', '', regex=True)
    
    # Converts to float; makes negative only if contained within parentheses
    conversion =  lambda x: float(re.sub('[(|)]', '', x))*-1 \
                            if '(' in x \
                            else float(x.strip('%'))/100 \
                            if '%' in x \
                            else float(x)
    
    return series.apply(conversion)

In [4]:
dirty_cols = ['x6', 'x10']

# Performs the cleaning operation on dirty columns
for col in dirty_cols:
    eda_data[col] = str_to_int(eda_data[col])
    eda_data_small[col] = str_to_int(eda_data_small[col])

##### Deliverables

In [7]:
print('Review columns x6 and x10')
eda_data.head()

Review columns x6 and x10


Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,y
0,-17.933519,6.55922,-14.45281,-4.732855,0.381673,2.563194,-1306.52,-89.394348,-28.454044,-16.201298,-0.0001,0.21701,9.729891,-0.786431,0.666146
1,-37.214754,10.77493,-15.384004,-0.077339,10.983774,-15.210206,-24.86,153.032652,-32.557736,69.675903,0.0,-3.584908,35.727926,-0.985552,0.378411
2,0.330441,-19.609972,-9.167911,2.064124,12.071688,12.506141,-110.85,-141.437276,-20.794952,55.042604,0.0,-3.991366,-9.283523,-3.394718,0.624498
3,-13.709765,-8.01139,6.759264,1.727615,-1.768382,24.039733,-324.43,51.039653,-7.046908,-31.424419,0.0001,7.908897,-2.891882,-2.690222,0.126622
4,-4.202598,7.07621,-26.004919,-4.269696,-3.414224,2.115989,1213.37,-31.0467,19.061182,-31.525515,-0.0001,0.846719,25.49748,3.516801,0.640025


In [8]:
print('Review columns x6 and x10')
eda_data_small.head()

Review columns x6 and x10


Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,y
0,-17.933519,6.55922,-14.45281,-4.732855,0.381673,2.563194,-1306.52,-89.394348,-28.454044,-16.201298,-0.0001,0.21701,9.729891,-0.786431,0.666146
1,-37.214754,10.77493,-15.384004,-0.077339,10.983774,-15.210206,-24.86,153.032652,-32.557736,69.675903,0.0,-3.584908,35.727926,-0.985552,0.378411
2,0.330441,-19.609972,-9.167911,2.064124,12.071688,12.506141,-110.85,-141.437276,-20.794952,55.042604,0.0,-3.991366,-9.283523,-3.394718,0.624498
3,-13.709765,-8.01139,6.759264,1.727615,-1.768382,24.039733,-324.43,51.039653,-7.046908,-31.424419,0.0001,7.908897,-2.891882,-2.690222,0.126622
4,-4.202598,7.07621,-26.004919,-4.269696,-3.414224,2.115989,1213.37,-31.0467,19.061182,-31.525515,-0.0001,0.846719,25.49748,3.516801,0.640025


In [2]:
'''
Unfortunately, the compiled report is only visible in Jupyter Notebooks if it was compiled during a live session.
This means the display is not visible in saved notebooks. The profile can be generated again using my code if desired.
'''

ProfileReport(eda_data_small)


