<img align="right" style="padding-left:50px;" src="figures_wk4/data_cleaning.png" width=350><br>
### User Bias in Data Cleaning
For your homework assignment this week, we will explore how our treatment of our data can impact the quality of our results.

**Dataset:**
The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs.

A list of the corresponding survey questions can be found [here](https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html).

 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
df= pd.read_csv('survey_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  28108 non-null  object 
 1   q1         28108 non-null  object 
 2   q2         28033 non-null  object 
 3   q3         28107 non-null  object 
 4   q4         7273 non-null   object 
 5   q5         28108 non-null  object 
 6   q6         20793 non-null  float64
 7   q7         28108 non-null  object 
 8   q8         211 non-null    object 
 9   q9         3047 non-null   object 
 10  q10        28108 non-null  object 
 11  q11        23074 non-null  object 
 12  q12        28026 non-null  object 
 13  q13        28108 non-null  object 
 14  q14        28108 non-null  object 
 15  q15        27885 non-null  object 
 16  q16        27937 non-null  object 
 17  q17        27931 non-null  object 
dtypes: float64(1), object(17)
memory usage: 3.9+ MB


In [4]:
df.head(10)

Unnamed: 0,timestamp,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White
5,4/27/2021 11:02:46,25-34,Education (Higher Education),Scholarly Publishing Librarian,,62000,,USD,,,USA,New Hampshire,Hanover,8 - 10 years,2 - 4 years,Master's degree,Man,White
6,4/27/2021 11:02:51,25-34,Publishing,Publishing Assistant,,33000,2000.0,USD,,,USA,South Carolina,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White
7,4/27/2021 11:03:00,25-34,Education (Primary/Secondary),Librarian,"High school, FT",50000,,USD,,,United States,Arizona,Yuma,5-7 years,5-7 years,Master's degree,Man,White
8,4/27/2021 11:03:01,45-54,Computing or Tech,Systems Analyst,Data developer/ETL Developer,112000,10000.0,USD,,,US,Missouri,St. Louis,21 - 30 years,21 - 30 years,College degree,Woman,White
9,4/27/2021 11:03:02,35-44,"Accounting, Banking & Finance",Senior Accountant,,45000,0.0,USD,,I work for a Charter School,United States,Florida,Palm Coast,21 - 30 years,21 - 30 years,College degree,Woman,"Hispanic, Latino, or Spanish origin, White"


### Assignment
Your goal for this assignment is to observe how your data treatment during the cleaning process can skew or bias the dataset.

Before diving right in, stop and read through the questions associated with the dataset. As you can see, they are either free-form text entries or categorical selections. Knowing this, perform some exploratory data analysis (EDA) to investigate the "state" of the dataset.

[Add as many code cell below here as needs]


In [5]:
df.shape

(28108, 18)

It can be seen that there are 28108 rows and 18 columns.

In [6]:
pd.set_option('display.float_format', '{:.2f}'.format)
df.describe()

Unnamed: 0,q6
count,20793.0
mean,18244.6
std,833624.9
min,0.0
25%,0.0
50%,2000.0
75%,10000.0
max,120000000.0


Here, we have seen that q6 is the only numerical feature available in the dataset.

In [7]:
median = df['q6'].median()
print(f"Median of 'q6': {median}")

Median of 'q6': 2000.0


In [8]:
df.describe(include='object')

Unnamed: 0,timestamp,q1,q2,q3,q4,q5,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17
count,28108,28108,28033,28107,7273,28108,28108,211,3047,28108,23074,28026,28108,28108,27885,27937,27931
unique,25326,7,1220,14377,7010,4319,11,124,2983,382,137,4841,8,8,6,5,51
top,4/27/2021 11:05:08,25-34,Computing or Tech,Software Engineer,Fundraising,60000,USD,INR,Hourly,United States,California,Boston,11 - 20 years,11 - 20 years,College degree,Woman,White
freq,5,12668,4711,286,20,430,23410,11,4,9004,2611,772,9630,6542,13536,21389,23235


In [9]:
missing  = df.isnull().sum()
print(missing)

timestamp        0
q1               0
q2              75
q3               1
q4           20835
q5               0
q6            7315
q7               0
q8           27897
q9           25061
q10              0
q11           5034
q12             82
q13              0
q14              0
q15            223
q16            171
q17            177
dtype: int64


In [10]:
missing_percentage  = (df.isnull().sum()/len(df)*100)
print(missing_percentage)

timestamp    0.00
q1           0.00
q2           0.27
q3           0.00
q4          74.12
q5           0.00
q6          26.02
q7           0.00
q8          99.25
q9          89.16
q10          0.00
q11         17.91
q12          0.29
q13          0.00
q14          0.00
q15          0.79
q16          0.61
q17          0.63
dtype: float64


There are several columns that are missing more than 70% of the values. They are q4, q8, and q9. 

**Question:** How would you describe the "state" of this dataset? Be specific and detailed in your answer. (Think paragraphs rather than sentences).

#### ANSWER

#### This data has a total of 28108 rows and 18 columns. Out of the 18 column of 1 column, i.e q6 is a numerical feature, rest of the columns are categorical in nature. After looking at the dataset, there are a lot of features that are missing more than 70% of the values, namely column q4, q8, and q9. In particular, the column q8 is almost missing all the values.These columns should be dropped from the dataset as there isn't much that can be done with those features. They are missing majority of the values, and will only hamper the results of the trained model. 

#### From the `df.describe()` we can see that q6 has a range of 0 to 120000000, the lower range being 0 and the latter being 120000000. The mean of the q6 column is 18244.60 whereas the median is 2000, which is a huge difference. This shows that the column is highly skewed. This shows that there are significant outliers present in the q6 column.The numerical column (q6) is missing about 26% of the values in it's column.

#### The "state" of the dataset can be described as "below average". Although the dataset has a lot of information, there seems to be a lot of flawed data. This data has a lot of missing values, and outliers. The data consitency is poor, whereas the formatting of multiple features needs to be modified. For instance, the q5 column which represents the annual salaries containes comma in the values, which can make it hard to do a numerical analysis. This dataset needs a lot of work in pre-processing. There also is a lack of a standard name for United States of America in q10, which makes it difficult to collect accurate responses.


#### The Plan

Now, it is time to plan how you will clean up the dataset. You **are not** allowed to use any machine learning technique to clean the data. (No SMOTE! No machine learning! Or anything like that!)

**Question:** Based on your EDA above, detail how you would clean up this dataset. 
Things to consider: (This is not an exhaustive list)
- Are there columns that can't be effectively cleaned? If so, why?
- Are there columns that genuinely won't have a data value?
- Does it make sense to segment the dataset based on specific columns when determining how to handle the missing values?
- Are outliers a factor in this dataset?

Remember preserving as much of the data as possible is the goal. That means dropping rows with a missing value somewhere might not be the best idea.

### ANSWER

#### This dataset requires lots of pre-processing in oreder to make it a viable dataset for training model or prediction. The columns q4, q8, and q9 are missing more than 70% of the values in them, so I think that there is no other option other than to drop these columns. Talking in detail, q4 has 74.1% of it's data missing, followed by q8 at 99.2%, and q9 at 89.1% missing values. These features do not have any significant or critical information. So, it's best to drop them. These columns don't have any data value as well. The column q6 is missing about 26% of its values. This column doesn't need to be dropped. It is highly significant to the dataset. To make this column free of missing values, it will need imputation. Since, the q6 column keeps track of the monetary compensations, it will need to be imputated using median. As mentioned before, the q6 column contains a significant amount of outliers, which can skew the results. The difference between the mean and median for the column q6 is huge, which shows skewedness. So, it is appropriate to use the median imputations. The outliers are indeed a factor in this dataset. 

#### In addition to all that, there is also a need to address the inconsistent data in the dataset with the help of standardization. The column q5 can be converted into a numerical feature by removing the commas and symbols. The timestamp can also be converted into a datetime format. The column q10 has a lot of different versions of the same country names for eg. USA, United States, America, US etc. It can be standardized and made into the same standard name.

#### Implementation

Based on the plan the you described above, go ahead and clean up the dataset.

[Add as many code cell below here as needs]

In [11]:
cleaned_df = df.copy()

In [12]:
cleaned_df = cleaned_df.drop(['q4', 'q8', 'q9'], axis=1)

In [13]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  28108 non-null  object 
 1   q1         28108 non-null  object 
 2   q2         28033 non-null  object 
 3   q3         28107 non-null  object 
 4   q5         28108 non-null  object 
 5   q6         20793 non-null  float64
 6   q7         28108 non-null  object 
 7   q10        28108 non-null  object 
 8   q11        23074 non-null  object 
 9   q12        28026 non-null  object 
 10  q13        28108 non-null  object 
 11  q14        28108 non-null  object 
 12  q15        27885 non-null  object 
 13  q16        27937 non-null  object 
 14  q17        27931 non-null  object 
dtypes: float64(1), object(14)
memory usage: 3.2+ MB


In [14]:
cleaned_df['q6'] = cleaned_df['q6'].astype(str).str.replace(r'[^\d.]', '', regex=True)
cleaned_df['q6'] = pd.to_numeric(cleaned_df['q6'], errors='coerce')
median_q5 = cleaned_df['q6'].median()
cleaned_df['q6'].fillna(median_q5, inplace=True)
print(f"Missing values in q6 after imputation: {cleaned_df['q6'].isnull().sum()}")

Missing values in q6 after imputation: 0


In [15]:
cleaned_df.head(10)

Unnamed: 0,timestamp,q1,q2,q3,q5,q6,q7,q10,q11,q12,q13,q14,q15,q16,q17
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,55000,0.0,USD,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,54600,4000.0,GBP,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,34000,2000.0,USD,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,62000,3000.0,USD,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,60000,7000.0,USD,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White
5,4/27/2021 11:02:46,25-34,Education (Higher Education),Scholarly Publishing Librarian,62000,2000.0,USD,USA,New Hampshire,Hanover,8 - 10 years,2 - 4 years,Master's degree,Man,White
6,4/27/2021 11:02:51,25-34,Publishing,Publishing Assistant,33000,2000.0,USD,USA,South Carolina,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White
7,4/27/2021 11:03:00,25-34,Education (Primary/Secondary),Librarian,50000,2000.0,USD,United States,Arizona,Yuma,5-7 years,5-7 years,Master's degree,Man,White
8,4/27/2021 11:03:01,45-54,Computing or Tech,Systems Analyst,112000,10000.0,USD,US,Missouri,St. Louis,21 - 30 years,21 - 30 years,College degree,Woman,White
9,4/27/2021 11:03:02,35-44,"Accounting, Banking & Finance",Senior Accountant,45000,0.0,USD,United States,Florida,Palm Coast,21 - 30 years,21 - 30 years,College degree,Woman,"Hispanic, Latino, or Spanish origin, White"


In [16]:
cleaned_df['q5'] = cleaned_df['q5'].astype(str).str.replace(r'[^\d.]', '', regex=True)
cleaned_df['q5'] = pd.to_numeric(cleaned_df['q5'], errors='coerce')
print(cleaned_df[['q5']].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   q5      28108 non-null  int64
dtypes: int64(1)
memory usage: 219.7 KB
None


In [17]:
cleaned_df['q5'].head(10)

0     55000
1     54600
2     34000
3     62000
4     60000
5     62000
6     33000
7     50000
8    112000
9     45000
Name: q5, dtype: int64

In [18]:
cleaned_df['timestamp'] = pd.to_datetime(cleaned_df['timestamp'], errors='coerce')
print(cleaned_df[['timestamp']].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  28108 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 219.7 KB
None


In [19]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  28108 non-null  datetime64[ns]
 1   q1         28108 non-null  object        
 2   q2         28033 non-null  object        
 3   q3         28107 non-null  object        
 4   q5         28108 non-null  int64         
 5   q6         28108 non-null  float64       
 6   q7         28108 non-null  object        
 7   q10        28108 non-null  object        
 8   q11        23074 non-null  object        
 9   q12        28026 non-null  object        
 10  q13        28108 non-null  object        
 11  q14        28108 non-null  object        
 12  q15        27885 non-null  object        
 13  q16        27937 non-null  object        
 14  q17        27931 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(12)
memory usage: 3.2+ MB


In [20]:
cleaned_df.q10.unique()

array(['United States', 'United Kingdom', 'US', 'USA', 'Canada',
       'United Kingdom ', 'usa', 'UK', 'Scotland ', 'U.S.',
       'United States ', 'The Netherlands', 'Australia ', 'Spain', 'us',
       'Usa', 'England', 'finland', 'United States of America', 'France',
       'United states', 'Scotland', 'USA ', 'United states ', 'Germany',
       'UK ', 'united states', 'Ireland', 'India', 'Australia', 'Uk',
       'United States of America ', 'U.S. ', 'canada', 'Canada ', 'U.S>',
       'ISA', 'Argentina', 'Great Britain ', 'US ', 'United State',
       'U.S.A', 'Denmark', 'U.S.A.', 'America', 'Netherlands',
       'netherlands', 'England ', 'united states of america', 'Ireland ',
       'Switzerland', 'Netherlands ', 'Bermuda', 'Us',
       'The United States', 'United State of America', 'Germany ',
       'Malaysia', 'Mexico ', 'United Stated', 'South Africa ', 'Belgium',
       'Northern Ireland', 'u.s.', 'South Africa', 'UNITED STATES',
       'united States', 'Sweden', 'Hong K

In [21]:
cleaned_df.q10.unique()

array(['United States', 'United Kingdom', 'US', 'USA', 'Canada',
       'United Kingdom ', 'usa', 'UK', 'Scotland ', 'U.S.',
       'United States ', 'The Netherlands', 'Australia ', 'Spain', 'us',
       'Usa', 'England', 'finland', 'United States of America', 'France',
       'United states', 'Scotland', 'USA ', 'United states ', 'Germany',
       'UK ', 'united states', 'Ireland', 'India', 'Australia', 'Uk',
       'United States of America ', 'U.S. ', 'canada', 'Canada ', 'U.S>',
       'ISA', 'Argentina', 'Great Britain ', 'US ', 'United State',
       'U.S.A', 'Denmark', 'U.S.A.', 'America', 'Netherlands',
       'netherlands', 'England ', 'united states of america', 'Ireland ',
       'Switzerland', 'Netherlands ', 'Bermuda', 'Us',
       'The United States', 'United State of America', 'Germany ',
       'Malaysia', 'Mexico ', 'United Stated', 'South Africa ', 'Belgium',
       'Northern Ireland', 'u.s.', 'South Africa', 'UNITED STATES',
       'united States', 'Sweden', 'Hong K

In [22]:
cleaned_df.q10.replace(
    [
        'United States', 'United States of America', 'USA', 'U.S.', 'U.S. ', 'U.S>', 'usa', 
        'Usa', 'ISA', 'united states of america', 'Us', 'us', 'The United States', 
        'UNITED STATES', 'united States,', 'USA-- Virgin Islands', 'United Statws', 
        'Unites States ', 'U.S.A.', 'U.S.A. ', 'U. S. ', 'United States of American ', 
        'United States of America'
    ], 'United States of America', inplace=True
)

In [23]:
cleaned_df.q10.unique()

array(['United States of America', 'United Kingdom', 'US', 'Canada',
       'United Kingdom ', 'UK', 'Scotland ', 'United States ',
       'The Netherlands', 'Australia ', 'Spain', 'England', 'finland',
       'France', 'United states', 'Scotland', 'USA ', 'United states ',
       'Germany', 'UK ', 'united states', 'Ireland', 'India', 'Australia',
       'Uk', 'United States of America ', 'canada', 'Canada ',
       'Argentina', 'Great Britain ', 'US ', 'United State', 'U.S.A',
       'Denmark', 'America', 'Netherlands', 'netherlands', 'England ',
       'Ireland ', 'Switzerland', 'Netherlands ', 'Bermuda',
       'United State of America', 'Germany ', 'Malaysia', 'Mexico ',
       'United Stated', 'South Africa ', 'Belgium', 'Northern Ireland',
       'u.s.', 'South Africa', 'united States', 'Sweden', 'Hong Kong',
       'Kuwait', 'Norway', 'Sri lanka', 'Contracts', 'England/UK', 'U.S',
       "We don't get raises, we get quarterly bonuses, but they periodically asses income in the ar

In [24]:
cleaned_df['q10'] = cleaned_df['q10'].astype(str).str.strip().str.lower()

In [25]:
cleaned_df.q10.unique()

array(['united states of america', 'united kingdom', 'us', 'canada', 'uk',
       'scotland', 'united states', 'the netherlands', 'australia',
       'spain', 'england', 'finland', 'france', 'usa', 'germany',
       'ireland', 'india', 'argentina', 'great britain', 'united state',
       'u.s.a', 'denmark', 'america', 'netherlands', 'switzerland',
       'bermuda', 'united state of america', 'malaysia', 'mexico',
       'united stated', 'south africa', 'belgium', 'northern ireland',
       'u.s.', 'sweden', 'hong kong', 'kuwait', 'norway', 'sri lanka',
       'contracts', 'england/uk', 'u.s',
       "we don't get raises, we get quarterly bonuses, but they periodically asses income in the area you work, so i got a raise because a 3rd party assessment showed i was paid too little for the area we were located",
       'england, uk.', 'greece', 'japan', 'britain', 'united sates',
       'austria', 'brazil', 'canada, ottawa, ontario', 'global',
       'uniited states', 'united kingdom (engl

In [26]:
usa_variations = [
    "usa", "united states", "united states of america", "us", "u.s.", "u.s.a",
    "america", "united state", "united sates", "united statws", "united statesp",
    "united stattes", "united statea", "united statees", "unites states"
]

cleaned_df['q10'] = cleaned_df['q10'].apply(lambda x: "United States" if x in usa_variations else x)

print(cleaned_df['q10'].unique())

['United States' 'united kingdom' 'canada' 'uk' 'scotland'
 'the netherlands' 'australia' 'spain' 'england' 'finland' 'france'
 'germany' 'ireland' 'india' 'argentina' 'great britain' 'denmark'
 'netherlands' 'switzerland' 'bermuda' 'united state of america'
 'malaysia' 'mexico' 'united stated' 'south africa' 'belgium'
 'northern ireland' 'sweden' 'hong kong' 'kuwait' 'norway' 'sri lanka'
 'contracts' 'england/uk' 'u.s'
 "we don't get raises, we get quarterly bonuses, but they periodically asses income in the area you work, so i got a raise because a 3rd party assessment showed i was paid too little for the area we were located"
 'england, uk.' 'greece' 'japan' 'britain' 'austria' 'brazil'
 'canada, ottawa, ontario' 'global' 'uniited states'
 'united kingdom (england)'
 'worldwide (based in us but short term trips aroudn the world)' 'canadw'
 'hungary' 'luxembourg' 'united sates of america'
 'united states (i work from home and my clients are all over the us/canada/pr'
 'colombia' 'unt

#### Reflection
Write a short reflection (400-500 words) answering the following: 
- What were the biggest issues you encountered in the messy dataset?
- How did cleaning the dataset improve its usability for machine learning?
- What would happen if we trained a model on the messy dataset vs. the cleaned one?
- Do you feel you skewed or biased the dataset while cleaning it?

### ANSWER

#### The biggest issue I have encountered in the messy dataset is standardizing the name of the different countries, as I tried to do it but failed. I tried replacing the names of the different variations of the names of USA but the desired outcome wasn't achieved. I had the most trouble working on the Q10 column. Another thing that was slightly hard was working on the cleaning the q5 column and turning it into a numerical feature. First I had to remove the various commas as well as monetary symbols present there, then I had to change the data format type from object to integer to make it a numerical value. Other than that, all the other pre-processing techniques were less complex.

#### Cleaning a dataset can drastically improve the quality of the dataset, make it accurate, consistent and reliable to train a model. If a dataset is used to train a model, it can give out inconsistent, skewed, and false predictions or results. A raw dataset usually contains missing values, duplicates, outliers, and other inconsistencies in the dataset, without removal of such issues, the training process for a model can be very difficult and can hamper the performance of the model. A clean dataset is always the key to a well trained model.

#### If a dataset is to trained in a messy dataset, the outcome is simply inaccurate. The garbage in, garbage out theory works perfectly to put the scenario where a messy dataset is used to train a model. The messy dataset might have a lot of noise, which can result in the trained model making skewed predictions and patterns that have no real data value. The outliers present in the dataset can skew the numerical and statistical values of the dataset, resulting in poor generalization when they are provided with unseen data. It might also use much more computational resources. On the other hand, a clean dataset makes the model efficient, consistent and accurate. It can make better predictions and make more calculated decisions. A clean dataset reduces the chances of overfitting. A model that is trained with clean dataset works faster as well because it doesn't have to work with inconsistent or missing values. It makes the model reliable for real world application.&nbsp;

### I do not think that I have skewed or created and bias while cleaning the dataset. The goal of cleaning the dataset was to address skewness and other inconsistencies. If a new bias was created while removing another, then the point of cleaning the dataset becomes moot. I think I have cleaned the dataset into a more cleaner dataset, which can be considerbly more reliable than if it were to used uncleaned.

## Deliverables
Upload your Jupyter Notebook to your GitHub repo and then provide a link to that repo in Worlclass. 