# How to Clean Data with Python

In [8]:
import pandas as pd
import numpy as np
import glob
import re

## 2. Diagnose the Data

**Task 1**  
- We have provided two DataFrames, `df1` and `df2`.
- Inspect both of these DataFrames using the functions listed above.
- Start by printing the `.head()` of both `df1` and `df2`.

<br>

**Task 2**  
- Explore the DataFrames using the other functions listed.
- Which DataFrame is “clean”, and ready for analysis? 
- Create a variable named `clean` and set it equal to either `1` for `df1` or `2` for `df2`, depending on which DataFrame is clean.

In [3]:
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")

In [4]:
# Task 1
df1.head()

Unnamed: 0,Grocery Item,Cake Recipe,Pancake Recipe,Cookie Recipe
0,Eggs,2,3,1
1,Milk,1,2,1
2,Flour,2,1,2


In [5]:
# Task 1
df2.head()

Unnamed: 0,Grocery Item,Recipe,Number
0,Eggs,Cake Recipe,2
1,Milk,Cake Recipe,1
2,Flour,Cake Recipe,2
3,Eggs,Pancake Recipe,3
4,Milk,Pancake Recipe,2


In [6]:
# Task 2
clean = 2

## 3. Dealing with Multiple Files

**Task 1**  
- We have 10 different files containing 100 students each. 
- These files follow the naming structure:
    - `exams0.csv`
    - `exams1.csv`
    - … up to `exams9.csv`
- We are going to import each file using pandas, and combine all of the entries into one DataFrame.
- First, create a variable called `student_files` and set it equal to the `glob()` of all of the `csv` files we want to import.

<br>

**Task 2**  
- Create an empty list called `df_list` that will store all of the DataFrames we make from the files `exams0.csv` through `exams9.csv`.

<br>

**Task 3**  
- Loop through the filenames in `student_files`, and create a DataFrame from each file. 
- Append this DataFrame to `df_list`.

<br>

**Task 4**  
- Concatenate all of the DataFrames in `df_list` into one DataFrame called students.

<br>

**Task 5**  
- Print `students` and the length of `students`. 
- Did we get all of them?

In [9]:
# Task 1
student_files = glob.glob("exams*.csv")

# Task 2
df_list = []

# Task 3
for filename in student_files:
    data = pd.read_csv(filename)
    df_list.append(data)

# Task 4
students = pd.concat(df_list)

# Task 5
students.head(), len(students)

(   id        full_name gender_age fractions probability       grade
 0   0   Barrett Feragh        M14       76%         72%   9th grade
 1   1  Llewellyn Keech        M14       83%         NaN  12th grade
 2   2  Llewellyn Keech        M14       83%         NaN  12th grade
 3   3     Terrell Geri        M15       80%         86%  11th grade
 4   4   Gram Hallewell        M14       67%         78%  10th grade,
 1000)

## 4. Reshaping Your Data

**Task 1**  
- Print out the columns of `students`.

<br>

**Task 2**  
- There is a column for the scores on the `fractions` exam, and a column for the scores on the `probabilities` exam.
- We want to make each row an observation, so we want to transform this table to look like:
    | *full_name* | *exam* | *score* | *gender_age* | *grade* |
    |-------------|--------|---------|--------------|---------|
    | “First Student” | “Fractions” | score% | ... | ... |
    | “First Student” | “Probabilities” | score% | ... | ... |
    | “Second Student” | “Fractions” | score% | ... | ... |
    | “Second Student” | “Probabilities” | score% | ... | ... |
    | ... | ... | ... | ... | ... |
- Use `pd.melt()` to create a new table (still called `students`) that follows this structure.

<br>

**Task 3**  
- Print the `.head()` and the `.columns` of students.
- Also, print out the `.value_counts()` of the column exam.

In [10]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]

In [11]:
# Task 1
students

Unnamed: 0,full_name,gender_age,fractions,probability,grade
0,Moses Kirckman,M14,69%,89%,11th grade
1,Timofei Strowan,M18,63%,76%,11th grade
2,Silvain Poll,M18,69%,77%,9th grade
3,Lezley Pinxton,M18,,72%,11th grade
4,Bernadene Saunper,F17,72%,84%,11th grade
...,...,...,...,...,...
995,Wilie Stillert,F14,72%,69%,9th grade
996,Gertie Flicker,F15,,86%,11th grade
997,Yettie Labes,F14,81%,82%,12th grade
998,Lock McGuinley,M18,,84%,10th grade


In [16]:
# Task 2
students = pd.melt(frame=students, id_vars=["full_name", "gender_age", "grade"], value_vars=["fractions", "probability"], value_name="score", var_name="exam")

In [18]:
# Task 3
students.head()

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%


In [20]:
students.columns, students.value_counts("exam")

(Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object'),
 exam
 fractions      1000
 probability    1000
 Name: count, dtype: int64)

## 5. Dealing with Duplicates

**Task 1**  
- It seems like in the data collection process, some rows may have been recorded twice. 
- Use the `.duplicated()` function on the `students` DataFrame to make a Series object called `duplicates`.

<br>

**Task 2**  
- Print out the `.value_counts()` of the `duplicates` Series to see how many rows are exact duplicates.

<br>

**Task 3**  
- Update the value of `students` to be the `students` table with the duplicates dropped.

<br>

**Task 4**  
- Use the `.duplicated()` function again to make a Series object called `duplicates` after dropping the duplicates. 
- Print out the value counts again. 
- Are there any `True`s left?

In [27]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')

students

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%
...,...,...,...,...,...
1995,Wilie Stillert,F14,9th grade,probability,69%
1996,Gertie Flicker,F15,11th grade,probability,86%
1997,Yettie Labes,F14,12th grade,probability,82%
1998,Lock McGuinley,M18,10th grade,probability,84%


In [28]:
# Task 1
duplicates = students.duplicated()

# Task 2
print(duplicates.value_counts())

# Task 3
students = students.drop_duplicates()

# Task 4
duplicates = students.duplicated()
print(duplicates.value_counts())

False    1976
True       24
Name: count, dtype: int64
False    1976
Name: count, dtype: int64


## Splitting by Index

**Task 1**  
- Print out the columns of the `students` DataFrame.

<br>

**Task 2**  
- The column `gender_age` sounds like it contains both `gender` and `age`!
- Print out the `.head()` of the column to see what kind of data it contains.

<br>

**Task 3**  
- It looks like the first character of the values in `gender_age` contains the gender, while the rest of the string contains the age. 
- Let’s separate out the gender data into a new column called `gender`.

<br>

**Task 4**  
- Now, separate out the age data into a new column called `age`.

<br>

**Task 5**  
- Let’s print the `.head()` of `students` to see how the DataFrame looks after our creation of new columns.

<br>

**Task 6**  
- Now, we don’t need that `gender_age` column anymore.
- Let’s set the `students` DataFrame to be the `students` DataFrame with all columns except `gender_age`.

In [33]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()

students

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%
...,...,...,...,...,...
1995,Wilie Stillert,F14,9th grade,probability,69%
1996,Gertie Flicker,F15,11th grade,probability,86%
1997,Yettie Labes,F14,12th grade,probability,82%
1998,Lock McGuinley,M18,10th grade,probability,84%


In [34]:
# Task 1
print(students.columns)

# Task 2
print(students.head())

# Task 3
students["gender"] = students.gender_age.str[:1]

# Task 4
students["age"] = students.gender_age.str[1:]

# Task 5
print(students.head())

# Task 6
students = students.drop(columns="gender_age")

students.head()

Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object')
           full_name gender_age       grade       exam score
0     Moses Kirckman        M14  11th grade  fractions   69%
1    Timofei Strowan        M18  11th grade  fractions   63%
2       Silvain Poll        M18   9th grade  fractions   69%
3     Lezley Pinxton        M18  11th grade  fractions   NaN
4  Bernadene Saunper        F17  11th grade  fractions   72%
           full_name gender_age       grade       exam score gender age
0     Moses Kirckman        M14  11th grade  fractions   69%      M  14
1    Timofei Strowan        M18  11th grade  fractions   63%      M  18
2       Silvain Poll        M18   9th grade  fractions   69%      M  18
3     Lezley Pinxton        M18  11th grade  fractions   NaN      M  18
4  Bernadene Saunper        F17  11th grade  fractions   72%      F  17


Unnamed: 0,full_name,grade,exam,score,gender,age
0,Moses Kirckman,11th grade,fractions,69%,M,14
1,Timofei Strowan,11th grade,fractions,63%,M,18
2,Silvain Poll,9th grade,fractions,69%,M,18
3,Lezley Pinxton,11th grade,fractions,,M,18
4,Bernadene Saunper,11th grade,fractions,72%,F,17


## 7. Splitting by Character

**Task 1**  
- The students’ names are stored in a column called `full_name`.
- We want to separate this data out into two new columns, `first_name` and `last_name`.
- First, let’s create a Series object called `name_split` that splits the `full_name` by the `" "` character.

<br>

**Task 2**  
- Now, let’s create a column called `first_name` that takes the first item in `name_split`.

<br>

**Task 3**  
- Finally, let’s create a column called `last_name` that takes the second item in `name_split`.

<br>

**Task 4**  
- Print out the `.head()` of students to see how the DataFrame has changed.

In [35]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]
students = students[['full_name','exam','score','gender','age']]

students

Unnamed: 0,full_name,exam,score,gender,age
0,Moses Kirckman,fractions,69%,M,14
1,Timofei Strowan,fractions,63%,M,18
2,Silvain Poll,fractions,69%,M,18
3,Lezley Pinxton,fractions,,M,18
4,Bernadene Saunper,fractions,72%,F,17
...,...,...,...,...,...
1995,Wilie Stillert,probability,69%,F,14
1996,Gertie Flicker,probability,86%,F,15
1997,Yettie Labes,probability,82%,F,14
1998,Lock McGuinley,probability,84%,M,18


In [36]:
# Task 1
name_split = students.full_name.str.split(" ")

# Task 2
students["first_name"] = name_split.str.get(0)

# Task 3
students["last_name"] = name_split.str.get(1)

# Task 4
students.head()

Unnamed: 0,full_name,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,fractions,69%,M,14,Moses,Kirckman
1,Timofei Strowan,fractions,63%,M,18,Timofei,Strowan
2,Silvain Poll,fractions,69%,M,18,Silvain,Poll
3,Lezley Pinxton,fractions,,M,18,Lezley,Pinxton
4,Bernadene Saunper,fractions,72%,F,17,Bernadene,Saunper


## 8. Looking at Types

**Task 1**  
- Let’s inspect the dtypes in the `students` table.
- Print out the `.dtypes` attribute.

<br>

**Task 2**  
- If we wanted to make a scatterplot of `age` vs average exam score, would we be able to do it with this type of data?
- Try to print out the mean of the `score` column of `students`.

In [37]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]
students = students[['full_name','exam','score','gender','age']]
name_split = students['full_name'].str.split(" ")
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)

students.head()

Unnamed: 0,full_name,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,fractions,69%,M,14,Moses,Kirckman
1,Timofei Strowan,fractions,63%,M,18,Timofei,Strowan
2,Silvain Poll,fractions,69%,M,18,Silvain,Poll
3,Lezley Pinxton,fractions,,M,18,Lezley,Pinxton
4,Bernadene Saunper,fractions,72%,F,17,Bernadene,Saunper


In [38]:
students.dtypes

full_name     object
exam          object
score         object
gender        object
age           object
first_name    object
last_name     object
dtype: object

In [None]:
students.score.mean()   # expected error, cause score is not stored as a number

## 9. String Parsing

**Task 1**  
- We saw in the last exercise that finding the mean of the `score` column is hard to do when the data is stored as `Object`s and not numbers.
- Use regex to take out the % signs in the `score` column.

<br>

**Task 2**  
- Convert the `score` column to a numerical type using the `pd.to_numeric()` function.

In [42]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]
students = students[['full_name','exam','score','gender','age']]
name_split = students['full_name'].str.split(" ")
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)

In [43]:
# Task 1
students["score"] = students["score"].str.replace("%", "")

# Task 2
students["score"] = pd.to_numeric(students["score"])

students.dtypes

full_name      object
exam           object
score         float64
gender         object
age            object
first_name     object
last_name      object
dtype: object

## 10. More String Parsing

**Task 1**  
- Print out the first five rows of the `grade` column.

<br>

**Task 2**  
- Each value in `grade` looks like “9th grade”, “10th grade”, “11th grade”, or “12th grade”.
- We want to pare that down to just having the numerical grade. 
- Maybe we want to do linear regression on this data, which would require numerical inputs.
- Use regex to extract the number from each string in `grade` and store those values back into the `grade` column.

<br>

**Task 3**  
- Print the dtypes of the `students` table.

<br>

**Task 4**  
- Convert the `grade` column to be numerical values instead of objects.

<br>

**Task 5**  
- Calculate the mean of `grade`, store it in a variable called `avg_grade`, and then print it out!
- We could not have done this with strings like “9th grade” or “10th grade”.

In [49]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]
students = students[['full_name','exam','score','gender','age','grade']]
name_split = students['full_name'].str.split(" ")
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)
students.score = students['score'].replace('[\%,]', '', regex=True)
students.score = pd.to_numeric(students['score'])

students

Unnamed: 0,full_name,exam,score,gender,age,grade,first_name,last_name
0,Moses Kirckman,fractions,69.0,M,14,11th grade,Moses,Kirckman
1,Timofei Strowan,fractions,63.0,M,18,11th grade,Timofei,Strowan
2,Silvain Poll,fractions,69.0,M,18,9th grade,Silvain,Poll
3,Lezley Pinxton,fractions,,M,18,11th grade,Lezley,Pinxton
4,Bernadene Saunper,fractions,72.0,F,17,11th grade,Bernadene,Saunper
...,...,...,...,...,...,...,...,...
1995,Wilie Stillert,probability,69.0,F,14,9th grade,Wilie,Stillert
1996,Gertie Flicker,probability,86.0,F,15,11th grade,Gertie,Flicker
1997,Yettie Labes,probability,82.0,F,14,12th grade,Yettie,Labes
1998,Lock McGuinley,probability,84.0,M,18,10th grade,Lock,McGuinley


In [50]:
# Task 1
print(students.grade.head())

# Task 2
split_df = students.grade.str.split("(\d+)", expand=True)
students.grade = split_df[1]

# Task 3
print(students.dtypes)

# Task 4
students.grade = pd.to_numeric(students.grade)

# Task 5
avg_grade = students.grade.mean()
print(avg_grade)

0    11th grade
1    11th grade
2     9th grade
3    11th grade
4    11th grade
Name: grade, dtype: object
full_name      object
exam           object
score         float64
gender         object
age            object
grade          object
first_name     object
last_name      object
dtype: object
10.620445344129555


## 11. Missing Values

**Task 1**  
- Get the mean of the score column. Store it in `score_mean` and print it out.

<br>

**Task 2**  
- We will assume that everyone who doesn’t have a score for an exam missed the test. 
- We want to replace all `nan`s with a score of `0`. Let’s do this with the `score` column.
- Fill all of the `nan`s in `students['score']` with `0`.

<br>

**Task 3**  
- Get the mean of the `score` column again. 
- Store it in `score_mean_2` and print it out.

In [51]:
students = pd.read_csv("students.csv")
students = students[['full_name', 'gender_age','fractions','probability','grade']]
students = pd.melt(frame=students, id_vars=['full_name','gender_age','grade'], value_vars=['fractions', 'probability'], value_name='score', var_name='exam')
students = students.drop_duplicates()
students['gender'] = students.gender_age.str[0]
students['age'] = students.gender_age.str[1:]
students = students[['full_name','exam','score','gender','age','grade']]
name_split = students['full_name'].str.split(" ")
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)
students.score = students['score'].replace('[\%,]', '', regex=True)
students.score = pd.to_numeric(students['score'])
students.grade = students.grade.str.split('(\d+)', expand=True)[1]
students.grade = pd.to_numeric(students.grade)

students

Unnamed: 0,full_name,exam,score,gender,age,grade,first_name,last_name
0,Moses Kirckman,fractions,69.0,M,14,11,Moses,Kirckman
1,Timofei Strowan,fractions,63.0,M,18,11,Timofei,Strowan
2,Silvain Poll,fractions,69.0,M,18,9,Silvain,Poll
3,Lezley Pinxton,fractions,,M,18,11,Lezley,Pinxton
4,Bernadene Saunper,fractions,72.0,F,17,11,Bernadene,Saunper
...,...,...,...,...,...,...,...,...
1995,Wilie Stillert,probability,69.0,F,14,9,Wilie,Stillert
1996,Gertie Flicker,probability,86.0,F,15,11,Gertie,Flicker
1997,Yettie Labes,probability,82.0,F,14,12,Yettie,Labes
1998,Lock McGuinley,probability,84.0,M,18,10,Lock,McGuinley


In [52]:
# Task 1
score_mean = students.score.mean()
print(score_mean)

# Task 2
students["score"] = students["score"].fillna(0)

# Task 3
score_mean_2 = students.score.mean()
score_mean_2

77.69657422512235


np.float64(72.30971659919028)