## Lecture Housekeeping:

- The use of disrespectful language is prohibited in the questions, this is a supportive, learning environment for all - please engage accordingly.
    - Please review Code of Conduct (in Student Undertaking Agreement) if unsure
- No question is daft or silly - ask them!
- There are Q&A sessions midway and at the end of the session, should you wish to ask any follow-up questions.
- Should you have any questions after the lecture, please schedule a mentor session.
- For all non-academic questions, please submit a query: [www.hyperiondev.com/support](www.hyperiondev.com/support)

## Data Analysis

#### Learning objectives

   - Explore ways we can tailor our datasets to be better fitted for our goals
   - Discuss data cleaning with examples of common errors and inconsistencies



Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. It is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.


In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv('balance.txt', delim_whitespace=True)
df.head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
0,12.240798,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian
1,23.283334,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian
2,22.530409,104.593,7075,514,4,71,11,Male,No,No,Asian
3,27.652811,148.924,9504,681,3,36,11,Female,No,No,Asian
4,16.893978,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian


### Dropping Columns in a DataFrame

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.
In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with the `drop()` function. Let’s look at a simple example where we drop a number of columns from a DataFrame.


In [6]:
df.drop(['Limit','Age'], inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the `drop()` function on our object, passing in the inplace parameter as `True` and the axis parameter as `1`. This tells pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed.

In [None]:
df.head()

Unnamed: 0,Balance,Income,Rating,Cards,Education,Gender,Student,Married,Ethnicity
0,12.240798,14.891,283,2,11,Male,No,Yes,Caucasian
1,23.283334,106.025,483,3,15,Female,Yes,Yes,Asian
2,22.530409,104.593,514,4,11,Male,No,No,Asian
3,27.652811,148.924,681,3,11,Female,No,No,Asian
4,16.893978,55.882,357,2,16,Male,No,Yes,Caucasian


### Replace values

Sometimes you would like to replace a value from your data set with another value. For example if you had data with categories such as ‘Ethnicity’ and we wanted to rename one category lets say, 'African American' to 'African'.

In [None]:
df.replace('African American','African').head(10)

Unnamed: 0,Balance,Income,Rating,Cards,Education,Gender,Student,Married,Ethnicity
0,12.240798,14.891,283,2,11,Male,No,Yes,Caucasian
1,23.283334,106.025,483,3,15,Female,Yes,Yes,Asian
2,22.530409,104.593,514,4,11,Male,No,No,Asian
3,27.652811,148.924,681,3,11,Female,No,No,Asian
4,16.893978,55.882,357,2,16,Male,No,Yes,Caucasian
5,22.486178,80.18,569,4,10,Male,No,No,Caucasian
6,10.574516,20.996,259,2,12,Female,No,No,African
7,14.576204,71.408,512,2,9,Male,No,No,Asian
8,7.93809,15.125,266,5,13,Female,No,No,Caucasian
9,17.756965,71.061,491,3,19,Female,Yes,Yes,African


### Grouping Data

Grouping data sets is frequently applied in data analysis. For example, grouping is used when we need the result in terms of various groups present in the data set. Pandas has in-built methods which can roll the data into various groups.

In the below example we group the data by Ethnicity and then get the result for a specific Ethnic group.

In [None]:
grouped = df.groupby('Ethnicity')
grouped.get_group('Asian').head()

Unnamed: 0,Balance,Income,Rating,Cards,Education,Gender,Student,Married,Ethnicity
1,23.283334,106.025,483,3,15,Female,Yes,Yes,Asian
2,22.530409,104.593,514,4,11,Male,No,No,Asian
3,27.652811,148.924,681,3,11,Female,No,No,Asian
7,14.576204,71.408,512,2,9,Male,No,No,Asian
12,19.2188,80.616,394,1,7,Female,No,Yes,Asian


### Dealing with inconsistent data entry

To begin with, let us install a module that will help us clean our data set. Go to your command prompt or terminal and type `pip install fuzzywuzzy` or `pip3 install fuzzywuzzy`. You will also need to install `python-Levenshtein` and `chardet`.

In [5]:
#%pip install fuzzywuzzy
#%pip install chardet
# helpful libraries
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

# set seed for reproducibility
np.random.seed(0)

In [10]:
income_df = pd.read_csv("store_income_data_example.csv")
income_df.head()

Unnamed: 0,id,store_name,store_email,department,income,date_measured,country
0,1,"Cullen/Frost Bankers, Inc.",,Clothing,$54438554.24,14 July 2006,UK
1,2,Nordson Corporation,,Tools,$41744177.01,3 December 2006,united states of america
2,3,"Stag Industrial, Inc.",,Beauty,$36152340.34,12 August 2003,UNITED STATES
3,4,FIRST REPUBLIC BANK,ecanadine3@fc2.com,Automotive,$8928350.04,26 October 2006,UK
4,5,Mercantile Bank Corporation,,Baby,$33552742.32,24 December 1973,UK


#### Text pre-processing

For this exercise, we are  interested in cleaning up the "country" column to make sure there are no data entry inconsistencies in it. We could go through and check each row by hand and manually correct inconsistencies when we find them. But there's a more efficient way to do this!

In [11]:
countries = income_df['country'].unique()
print(f"There are {len(countries)} unique countries")
countries

There are 34 unique countries


array(['UK ', 'united states of america', 'UNITED STATES', 'uk',
       ' United States of America', 'South Africa ', 'United States.',
       'United States', 'South Africa/', 'United States ',
       'United States of America', 'South Africa.', 'United Kingdom ',
       'United States of America ', 'United States of America/',
       'south africa', 'UK/', 'United Kingdom.', ' United Kingdom',
       ' South Africa', 'United Kingdom/', 'SOUTH AFRICA', ' UK',
       'united kingdom', 'UNITED KINGDOM', ' United States',
       'UNITED STATES OF AMERICA', 'South Africa', 'United States/',
       'united states', 'United States of America.', 'UK',
       'United Kingdom', 'UK.'], dtype=object)

Just looking at this, we can see some problems due to inconsistent data entry. Let us look at the first entry 'United States of America/' and 'United States of America.'. These are the same countries but the computer understands them as different. The data capturer must have mistyped this country a few times.

The first thing we need to do is make everything lower case (We can change it back at the end if we'd like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this.

In [12]:
# convert to lower case
income_df['country'] = income_df['country'].str.lower()

# remove trailing white spaces
income_df['country'] = income_df['country'].str.strip()

# Let us view the data
countries = income_df['country'].unique()
print(f"There are {len(countries)} unique countries")
countries

There are 15 unique countries


array(['uk', 'united states of america', 'united states', 'south africa',
       'united states.', 'south africa/', 'south africa.',
       'united kingdom', 'united states of america/', 'uk/',
       'united kingdom.', 'united kingdom/', 'united states/',
       'united states of america.', 'uk.'], dtype=object)

Alright, let's take another look at the country column and see if there's any more data cleaning we need to do.

It does look like there are some remaining inconsistencies: 'united states of america/' and 'united states of america' should probably be the same.

We are going to use a `fuzzywuzzy` package to help identify which string are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach doesn't scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun! :)

Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (replace "i" with "o"). You won't always be able to rely on fuzzy matching, but it will usually end up saving you at least a little time.

Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of countries that have the closest distance to "uk".



In [13]:
# get the top 10 closest matches to "united kingdom"
matches = fuzzywuzzy.process.extract("uk", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('uk', 100),
 ('uk/', 100),
 ('uk.', 100),
 ('south africa', 14),
 ('south africa/', 14),
 ('south africa.', 14),
 ('united states', 13),
 ('united states.', 13),
 ('united states/', 13),
 ('united kingdom', 12)]

We can see that two of the items in the countries are very close to "uk": "uk/" and "uk.". Let's replace all rows in our country column that have a ratio of > 90 with "uk".

To do this, we going to write a function. (It's a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)

In [None]:
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    # let us know the function's done
    print("All done!")

Now that we have a function, we can put it to the test!



In [None]:
replace_matches_in_column(df=income_df, column='country', string_to_match="united kingdom")
replace_matches_in_column(df=income_df, column='country', string_to_match="united states")
replace_matches_in_column(df=income_df, column='country', string_to_match="united states of america")
replace_matches_in_column(df=income_df, column='country', string_to_match="south africa")
replace_matches_in_column(df=income_df, column='country', string_to_match="uk")

All done!
All done!
All done!
All done!
All done!
   id                   store_name         store_email  department  \
0   1   Cullen/Frost Bankers, Inc.                 NaN    Clothing   
1   2          Nordson Corporation                 NaN       Tools   
2   3        Stag Industrial, Inc.                 NaN      Beauty   
3   4          FIRST REPUBLIC BANK  ecanadine3@fc2.com  Automotive   
4   5  Mercantile Bank Corporation                 NaN        Baby   

         income     date_measured                   country  
0  $54438554.24      14 July 2006                        uk  
1  $41744177.01   3 December 2006  united states of america  
2  $36152340.34    12 August 2003             united states  
3   $8928350.04   26 October 2006                        uk  
4  $33552742.32  24 December 1973                        uk  


And now let's can check the unique values in our country column again and make sure we've tidied up "uk" correctly.



In [None]:
# get all the unique values in the 'country' column
countries = income_df['country'].unique()

print(f"There are {len(countries)} unique countries")
countries


There are 5 unique countries


array(['uk', 'united states of america', 'united states', 'south africa',
       'united kingdom'], dtype=object)

Now there is one thing left to do: note that the UK and the United Kingdom are the same country. We could use fuzzy logic to fix errors like this, but it can sometimes be risky - for example, 'United States' might match 'United Kingdom' better than 'UK'. It is important to exercise caution in such cases. To fix these errors, we can simply replace them.

In [None]:
income_df.replace('uk', 'united kingdom', inplace=True)
income_df.replace('united states of america', 'united states', inplace=True)

# get all the unique values in the 'country' column
countries = income_df['country'].unique()

print(f"There are {len(countries)} unique countries")
countries

There are 3 unique countries


array(['united kingdom', 'united states', 'south africa'], dtype=object)

### Working with Date and time

Analyzing datasets with dates and times is often very cumbersome. Months of different lengths, different distributions of weekdays and weekends, leap years, and the dreaded timezones are just a few things you may have to consider depending on your context. For this reason, Python has a data type specifically designed for dates and times called datetime.

In [15]:
# modules we'll use
from datetime import date
# print the first few rows of the date column
income_df.date_measured.head()


0        14 July 2006
1     3 December 2006
2      12 August 2003
3     26 October 2006
4    24 December 1973
Name: date_measured, dtype: object

Yep, those are dates! But just because we can understand that it doesnt mean that the computer understands them as so. Notice that at the bottom of the output of `head()`, you can see that it says that the data type of this column is "object".

Pandas uses the "object" dtype for storing various types of data types, but most often when you see a column with the dtype "object" it will have strings in it.

If you check the [pandas dtype documentation](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes), you'll notice that there's also a specific datetime64 dtype. Because the dtype of our column is object rather than datetime64, we can tell that Python doesn't know that this column contains dates.

We can also look at just the dtype of your column without printing the first few rows if we like:

In [None]:
# check the data type of our date column
income_df['date_measured'].dtype

dtype('O')

You may have to check the [numpy documentation](https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface) to match the letter code to the dtype of the object. "O" is the code for "object", so we can see that these two methods give us the same information.

### Convert our date columns to datetime

Now that we know that our date column isn't being recognized as a date, it's time to convert it so that it is recognized as a date. This is called "parsing dates" because we're taking in a string and identifying its component parts.

We can tell pandas what the format of our dates are with a guide called as "strftime directive". You can find more information about these directives in the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html). The basic idea is that you need to point out which parts of the date are where and what punctuation is between them. There are lots of possible parts of a date, 


|Code	|Meaning	|Example |
|---------|-----------|---------------|
|%A	 |   Weekday as locale’s full name |	Wednesday |
|%a	 |   Weekday as locale’s abbreviated name |	Wed |
|%B	 |  Month as locale’s full name |	June |
|%d	 | Day of the month |	06 |
|%m	 |   Month as a number |	6 |
|%Y	 |  Four-digit year |	2018 |
|%y	 |  Two-digit year |	18 |



Some examples:

1/17/07 has the format "%m/%d/%y"

17-1-2007 has the format "%d-%m-%Y"


Looking back up at the head of the date column in the landslides dataset, we can see that it's in the format "month/day/two-digit year", so we can use the same syntax as the first example to parse in our dates:

In [20]:
# create a new column, date_parsed, with the parsed dates


income_df['date_parsed'] = pd.to_datetime(income_df['date_measured'], format='%d %B %Y')


income_df['date_parsed'].head()


0   2006-07-14
1   2006-12-03
2   2003-08-12
3   2006-10-26
4   1973-12-24
Name: date_parsed, dtype: datetime64[ns]

Now when you check the first few rows of the new column, you can see that the dtype is datetime64. You can also see that the dates have been slightly rearranged so that they fit the default order datetime objects (year-month-day).

Now that our dates are parsed correctly, we can interact with them in useful ways.

What if I run into an error with multiple date formats? While we're specifying the date format here, sometimes you'll run into an error when there are multiple date formats in a single column. If that happens, you can have pandas try to infer what the right date format should be. You can do that like so:

`income_df['date_parsed'] = pd.to_datetime(income_df['Date'], infer_datetime_format=True)`

Why don't you always use `infer_datetime_format = True`? 
There are two big reasons not to always have pandas guess the time format.
The first is that pandas won't always been able to figure out the correct date format, especially if someone has gotten creative with data entry.
The second is that it's much slower than specifying the exact format of the dates.

**Bibliography**
1. Agarwal, M. (n.d.). Pythonic Data Cleaning With NumPy and Pandas. Retrieved April 23, 2019, from Real Python: https://realpython.com/python-data-cleaning-numpy-pandas/
2. Sethi, N. (2018). Data Cleaning: Parsing Dates. Retrieved from Data Driven Investor: https://medium.com/datadriveninvestor/data-cleaning-parsing-dates-34792fc4d6c8