# Reflective Writing for Data Science Career Path - Codecademy
by [Charalampos Spanias](https://cspanias.github.io/aboutme/) - February 2021

## Content
1. Getting Started with Data Science
2. Python Fundamentals
3. Data Acquisition
4. Data Manipulation with Pandas
5. [Data Wrangling \& Tidying](#wrangling)
    1. Fundamentals
    1. Regural Expressions
    1. [Data Cleaning](#cleaning)
        1. [Diagnose Data](#diagnose)
        1. [Multiple Files](#multiplefiles)
        1. [Reshaping Data](#reshapingdata)
        1. [Duplicates](#duplicates)
        1. [Index Splitting](#splitindex)
        1. [Character Splitting](#splitchar)
        1. [Looking at Types](#dtypes)
        1. [String Parsing](#stringparsing)
        1. [More String Parsing](#stringparsing+)
        1. [Missing Values](#nans)

<a name="wrangling"></a>
# 5. Data Wrangling & Tidying

<a name="cleaning"></a>
# 5.4 Data Cleaning

<a name="diagnose"></a>
## 5.4.2 Diagnose Data

We often describe __data that is easy to analyze and visualize as “tidy data”__. For data to be tidy, it must have:

1. Each variable as a separate column
1. Each row as a separate observation

The first step of diagnosing whether or not a dataset is tidy is using `pandas` functions to explore and probe the dataset.

Some of the most useful ones are:

* `.head()` — display the first 5 rows of the table
* `.info()` — display a summary of the table
* `.describe()` — display the summary statistics of the table
* `.columns` — display the column names of the table
* `.value_counts()` — display the distinct values for a column

In [1]:
import pandas as pd

df1 = pd.read_csv('df1.csv')
df1.head()

Unnamed: 0,Grocery Item,Cake Recipe,Pancake Recipe,Cookie Recipe
0,Eggs,2,3,1
1,Milk,1,2,1
2,Flour,2,1,2


In [2]:
df2 = pd.read_csv('df2.csv')
df2.head()

Unnamed: 0,Grocery Item,Recipe,Number
0,Eggs,Cake Recipe,2
1,Milk,Cake Recipe,1
2,Flour,Cake Recipe,2
3,Eggs,Pancake Recipe,3
4,Milk,Pancake Recipe,2


<a name="multiplefiles"></a>
## 5.4.3 Multiple Files

Often, you have the __same data separated out into multiple files__.

Let’s say that we have a ton of files following the filename structure: `'file1.csv'`, `'file2.csv'`, `'file3.csv'`, and so on. 

The power of pandas is mainly in being able to __manipulate large amounts of structured data__, so we want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.

We can combine the use of `glob`, which can open multiple files by using __regex matching to get the filenames__, with `pandas` to organize this data better.

In [3]:
import glob

# return all file paths that match the specific pattern
files = glob.glob('exams*.csv')

df_list = []
# iterate through files
for filename in files:
    # read each file
    data = pd.read_csv(filename)
    # append file's data
    df_list.append(data)
    
# concatenate all elements    
df = pd.concat(df_list)

print(files)
df.head()

['exams0.csv', 'exams1.csv', 'exams2.csv']


Unnamed: 0,id,full_name,gender_age,fractions,probability,grade
0,0,Kizzee Kensington,F17,68%,76%,12th grade
1,1,Garret Hartfleet,M17,68%,90%,12th grade
2,2,Ingrid Meatyard,F18,65%,,12th grade
3,3,Verney Sainsberry,M18,85%,76%,9th grade
4,4,Lonni Bruhnicke,F16,71%,81%,9th grade


<a name="reshapingdata"></a>
## 5.4.4 Reshaping Data

Since we want each variable as a separate column and each row as a separate observation, we can use `pd.melt()` to do this transformation. `.melt()` takes in a DataFrame, and the columns to unpack:

The parameters you provide are:

1. `frame`: the DataFrame you want to melt
1. `id_vars`: the column(s) of the old DataFrame to preserve
1. `value_vars`: the column(s) of the old DataFrame that you want to turn into variables
1. `value_name`: what to call the column of the new DataFrame that stores the values
1. `var_name`: what to call the column of the new DataFrame that stores the variables

The default names may work in certain situations, but it’s best to always have data that is self-explanatory. Thus, we often use `.columns()` to rename the columns after melting:

In [5]:
students = pd.read_csv("students.csv")

print(students.columns)

df

Index(['Unnamed: 0', 'full_name', 'gender_age', 'fractions', 'probability',
       'grade'],
      dtype='object')


Unnamed: 0,id,full_name,gender_age,fractions,probability,grade
0,0,Kizzee Kensington,F17,68%,76%,12th grade
1,1,Garret Hartfleet,M17,68%,90%,12th grade
2,2,Ingrid Meatyard,F18,65%,,12th grade
3,3,Verney Sainsberry,M18,85%,76%,9th grade
4,4,Lonni Bruhnicke,F16,71%,81%,9th grade
...,...,...,...,...,...,...
95,95,Massimo Babbs,M16,84%,83%,11th grade
96,96,Garrek Manuely,M14,70%,85%,9th grade
97,97,Winslow Lilliman,M15,81%,83%,11th grade
98,98,Grantham Yate,M15,77%,87%,10th grade


We want to make **each row an observation**.

In [6]:
students = pd.melt(frame=students, id_vars=['full_name', 'gender_age', 'grade'],
       value_vars=['fractions', 'probability'], value_name='score', var_name='exam')

print(students.columns, "\n")
print(students.exam.value_counts())
students.head()

Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object') 

fractions      1000
probability    1000
Name: exam, dtype: int64


Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%


<a name="duplicates"></a>
## 5.4.5 Duplicates

To check for duplicates, we can use the pandas function `.duplicated()`, which will return a Series telling us which rows are duplicate rows.

We can use the pandas `.drop_duplicates()` function to remove all rows that are duplicates of another row.
as exactly the same as another row. But the two "peach" rows remain because there is a difference in the price column.

If we wanted to remove every row with a duplicate value in the item column, we could specify a __subset__.

`df = df.drop_duplicates(subset=['col_name'])` By default, this __keeps the first occurrence of the duplicate__.


__Make sure that the columns you drop duplicates from are specifically the ones where duplicates don’t belong__. 

You wouldn’t want to drop duplicates with the price column as a subset, for example, because __it’s okay if multiple items cost the same amount__.

In [25]:
duplicates = students.duplicated()

# check the number of duplicates
print(duplicates.value_counts(), "\n")

students = students.drop_duplicates()

duplicates = students.duplicated()
print(duplicates.value_counts())

False    1976
dtype: int64 

False    1976
dtype: int64


<a name="splitindex"></a>
## 5.4.6 Index Splitting

In trying to get clean data, we want to __make sure each column represents one type of measurement__.

In [8]:
dates = ["18/02/2022", "19/02/2022", "20/02/2022"]

for date in dates:
    day = date[:2]
    month = date[3:5]
    year = date[-4:]
    
print(f"The day is: {day}.")
print(f"The month is: {month}.")
print(f"The year is: {year}.")

The day is: 20.
The month is: 02.
The year is: 2022.


<a name="splitchar"></a>
## 5.4.7 Character Splitting

When we have a column which contains two types of data, it might be __not possible to split them using slicing__ as above.

Let’s say we have a column called “type” with data entries in the format "admin_US" or "user_Kenya". Just like we saw before, this column actually contains two types of data. One seems to be the user type (with values like “admin” or “user”) and one seems to be the country this user is in (with values like “US” or “Kenya”).

We can no longer just split along the first 4 characters because admin and user are of different lengths. Instead, we know that we want to split along the "_". Using that, we can split this column into two separate, cleaner columns:

In [16]:
df = pd.DataFrame(["user_Kenya", "admin_US", "moderator_GR"], columns=['role_country'])
df

Unnamed: 0,role_country
0,user_Kenya
1,admin_US
2,moderator_GR


In [17]:
# Create the 'str_split' column
df['str_split'] = df.role_country.str.split('_')
df

Unnamed: 0,role_country,str_split
0,user_Kenya,"[user, Kenya]"
1,admin_US,"[admin, US]"
2,moderator_GR,"[moderator, GR]"


In [18]:
# Create the 'usertype' column
df['usertype'] = df.str_split.str.get(0)
df

Unnamed: 0,role_country,str_split,usertype
0,user_Kenya,"[user, Kenya]",user
1,admin_US,"[admin, US]",admin
2,moderator_GR,"[moderator, GR]",moderator


In [19]:
# Create the 'country' column
df['country'] = df.str_split.str.get(1)
df

Unnamed: 0,role_country,str_split,usertype,country
0,user_Kenya,"[user, Kenya]",user,Kenya
1,admin_US,"[admin, US]",admin,US
2,moderator_GR,"[moderator, GR]",moderator,GR


<a name="dtypes"></a>
## 5.4.8 Looking at Types

Each pandas __Series can hold items of the same data type__. The `dtypes` that pandas uses are:
1. float
1. int
1. bool
1. datetime
1. timedelta
1. category
1. object

In [24]:
students.dtypes

full_name     object
gender_age    object
grade         object
exam          object
score         object
dtype: object

<a name="stringparsing"></a>
## 5.4.9 String Parsing

Sometimes we need to modify strings in our DataFrames to help us __transform them into more meaningful metrics__.

In [36]:
fruits = pd.DataFrame({
                "item": ["banana", "apple", "peach", "peach"],
                "price": ["$1", "$0.75", "$3", "$4"],
                "calories": [105, 95, 55, 55]
})
print(fruits.dtypes)
fruits

item        object
price       object
calories     int64
dtype: object


Unnamed: 0,item,price,calories
0,banana,$1,105
1,apple,$0.75,95
2,peach,$3,55
3,peach,$4,55


In [37]:
fruits.price.replace("[\$]", "", regex=True, inplace=True)
fruits

Unnamed: 0,item,price,calories
0,banana,1.0,105
1,apple,0.75,95
2,peach,3.0,55
3,peach,4.0,55


In [38]:
fruits.price = pd.to_numeric(fruits.price)
print(fruits.dtypes)
fruits

item         object
price       float64
calories      int64
dtype: object


Unnamed: 0,item,price,calories
0,banana,1.0,105
1,apple,0.75,95
2,peach,3.0,55
3,peach,4.0,55


<a name="stringparsing+"></a>
## 5.4.10 More String Parsing

Sometimes we want to do analysis on __numbers that are hidden within string values__. 

We can __use regex to extract this numerical data__ from the strings they are trapped in. 

In [39]:
df = pd.DataFrame({
    "date": ["10/18/2018", "10/18/2018", "10/18/2018"],
    "exercise_reps": ["lungers - 30 reps", "squats - 20 reps", "deadlifts - 25 reps"]
})
df

Unnamed: 0,date,exercise_reps
0,10/18/2018,lungers - 30 reps
1,10/18/2018,squats - 20 reps
2,10/18/2018,deadlifts - 25 reps


In [40]:
split_df = df['exercise_reps'].str.split('(\d+)', expand=True)
split_df

Unnamed: 0,0,1,2
0,lungers -,30,reps
1,squats -,20,reps
2,deadlifts -,25,reps


In [42]:
df['reps'] = pd.to_numeric(split_df[1])
df['exercise'] = split_df[0].replace('[\- ]', '', regex=True)
df

Unnamed: 0,date,exercise_reps,reps,exercise
0,10/18/2018,lungers - 30 reps,30,lungers
1,10/18/2018,squats - 20 reps,20,squats
2,10/18/2018,deadlifts - 25 reps,25,deadlifts


<a name="nans"></a>
## 5.4.11 Missing Values

We often have data with missing elements, as a result of a __problem with the data collection process or errors in the way the data was stored__. The missing elements normally show up as `NaN` (or Not a Number) values:

Some calculations we do will just __skip the `NaN` values__ but some calculations or visualizations we try to perform will __break when a NaN is encountered__.

Most of the time, we use one of two methods to deal with missing values:
1. `df.dropna()` drops all rows with NaNs or `df.dropna(subset=[`col_name`])`
2. `df.fillna()` fill the all NaNs with some aggregate value or `df.fillna(value={"col_A":df.col_A.mean(), "col_B":df.col_B.mean()})`

In [46]:
students

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%
...,...,...,...,...,...
1995,Wilie Stillert,F14,9th grade,probability,69%
1996,Gertie Flicker,F15,11th grade,probability,86%
1997,Yettie Labes,F14,12th grade,probability,82%
1998,Lock McGuinley,M18,10th grade,probability,84%


In [53]:
students.score.replace('%',"", inplace=True, regex=True)
print(students.dtypes)
students.score.head(10)

full_name      object
gender_age     object
grade          object
exam           object
score         float64
dtype: object


0    69.0
1    63.0
2    69.0
3     NaN
4    72.0
5     NaN
6    86.0
7    81.0
8    68.0
9    59.0
Name: score, dtype: float64

In [54]:
students.score = pd.to_numeric(students.score)
mean_score = students.score.mean()
students.score.fillna(value=0, inplace=True)
students.score.head(10)

0    69.0
1    63.0
2    69.0
3     0.0
4    72.0
5     0.0
6    86.0
7    81.0
8    68.0
9    59.0
Name: score, dtype: float64

[pandas.Series.str.extract()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html)