# Reflective Writing for Data Science Career Path - Codecademy
by [Charalampos Spanias](https://cspanias.github.io/aboutme/) - February 2021

## Content
1. Getting Started with Data Science
2. Python Fundamentals
3. Data Acquisition
4. Data Manipulation with Pandas
5. Data Wrangling \& Tidying
    1. Fundamentals
    1. Regural Expressions
    1. Data Cleaning
    1. [Tidy Data](#tidy)
        1. [Tidy Data in Python](#article)
        1. [Pandas & Tidy Data](#pandas)
    1. [Pair Programming](#pair)
    1. [Exploratory Data Analysis](#eda)

<a name="article"></a>
# 5.4.1 Tidy Data in Python
[Article](https://www.jeannicholashould.com/tidy-data-in-python.html)

<a name="melt"></a>
## Melt

__Pew Research Center Dataset__: This dataset explores the relationship between income and religion.

__Problem__: 
* The columns headers are composed of the possible income values.

In [3]:
import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
import glob
import re

df = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/pew-raw.csv")
df

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont know/refused,15,14,15,11,10,35
5,Evangelical Prot,575,869,1064,982,881,1486
6,Hindu,1,9,7,9,11,34
7,Historically Black Prot,228,244,236,238,197,223
8,Jehovahs Witness,20,27,24,24,21,30
9,Jewish,19,19,25,25,30,95


A tidy version of this dataset is one in which the income values would not be columns headers but rather values in an `income` column. 

In order to tidy this dataset, we need to `melt` it. The pandas library has a built-in function that allows to do just that. It “unpivots” a DataFrame from a wide format to a long format.

In [4]:
formatted_df = pd.melt(df,
                       # cols to keep
                       ["religion"],
                       # new col name for var
                       var_name="income",
                       # col name for values
                       value_name="freq")
# sort alphabetically
formatted_df = formatted_df.sort_values(by=["religion"])
formatted_df.head(10)

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
30,Agnostic,$30-40k,81
40,Agnostic,$40-50k,76
50,Agnostic,$50-75k,137
10,Agnostic,$10-20k,34
20,Agnostic,$20-30k,60
41,Atheist,$40-50k,35
21,Atheist,$20-30k,37
11,Atheist,$10-20k,27
31,Atheist,$30-40k,52


<a name="meltB"></a>
## 4.2 Melt B

__Billboard Top 100 Dataset__: This dataset represents the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks.

__Problems__: 
* The columns headers are composed of values: the week number (`x1st.week`, …)
* If a song is in the Top 100 for less than 75 weeks, the remaining columns are filled with missing values.

In [9]:
df = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/billboard.csv", encoding="mac_latin2")
df.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,
5,2000,Janet,Doesn't Really Matter,4:17,Rock,2000-06-17,2000-08-26,59,52.0,43.0,...,,,,,,,,,,
6,2000,Destiny's Child,Say My Name,4:31,Rock,1999-12-25,2000-03-18,83,83.0,44.0,...,,,,,,,,,,
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,2000-04-01,2000-06-24,63,45.0,34.0,...,,,,,,,,,,
8,2000,Sisqo,Incomplete,3:52,Rock,2000-06-24,2000-08-12,77,66.0,61.0,...,,,,,,,,,,
9,2000,Lonestar,Amazed,4:25,Country,1999-06-05,2000-03-04,81,54.0,44.0,...,,,,,,,,,,


A tidy version of this dataset is one __without the week’s numbers as columns__ but rather as values of a single column. 

In order to do so, we’ll `melt` the weeks columns into a single date column. 

We will create __one row per week for each record__. If there is no data for the given week, we will not create a row.

In [10]:
# melting
id_vars = ["year",
           "artist.inverted",
           "track",
           "time",
           "genre",
           "date.entered",
           "date.peaked"]

df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank")

df.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0
5,2000,Janet,Doesn't Really Matter,4:17,Rock,2000-06-17,2000-08-26,x1st.week,59.0
6,2000,Destiny's Child,Say My Name,4:31,Rock,1999-12-25,2000-03-18,x1st.week,83.0
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,2000-04-01,2000-06-24,x1st.week,63.0
8,2000,Sisqo,Incomplete,3:52,Rock,2000-06-24,2000-08-12,x1st.week,77.0
9,2000,Lonestar,Amazed,4:25,Country,1999-06-05,2000-03-04,x1st.week,81.0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24092 entries, 0 to 24091
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             24092 non-null  int64  
 1   artist.inverted  24092 non-null  object 
 2   track            24092 non-null  object 
 3   time             24092 non-null  object 
 4   genre            24092 non-null  object 
 5   date.entered     24092 non-null  object 
 6   date.peaked      24092 non-null  object 
 7   week             24092 non-null  int32  
 8   rank             5307 non-null   float64
dtypes: float64(1), int32(1), int64(1), object(6)
memory usage: 1.6+ MB


In [18]:
# Cleaning out unnecessary rows
df = df.dropna()

# Formatting 
df["week"] = df['week'].str.extract('(\d+)', expand=False).astype(int)
df["rank"] = df["rank"].astype(int)

df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,1,78
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,1,15
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1,71
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,1,41
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,1,57


In [19]:
# Create "date" columns
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)

df = df[["year", 
         "artist.inverted",
         "track",
         "time",
         "genre",
         "week",
         "rank",
         "date"]]
df = df.sort_values(ascending=True, by=["year","artist.inverted","track","week","rank"])

# Assigning the tidy dataset to a variable for future usage
billboard = df

billboard.head()

Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date
246,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,1,87,2000-02-26
563,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2,82,2000-03-04
880,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,3,72,2000-03-11
1197,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,4,77,2000-03-18
1514,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,5,87,2000-03-25


Following up on the __Billboard dataset__, we’ll now address the repetition problem of the previous table.

__Problems__:
* Multiple observational units (the `song` and its `rank`) in a single table.

We’ll first create a `songs` table which contains the details of each song:

In [20]:
songs_cols = ["year", "artist.inverted", "track", "time", "genre"]
songs = billboard[songs_cols].drop_duplicates()
songs = songs.reset_index(drop=True)
songs["song_id"] = songs.index
songs.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,song_id
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,0
1,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,1
2,2000,3 Doors Down,Kryptonite,3:53,Rock,2
3,2000,3 Doors Down,Loser,4:24,Rock,3
4,2000,504 Boyz,Wobble Wobble,3:35,Rap,4
5,2000,98°,Give Me Just One Night (Una Noche),3:24,Rock,5
6,2000,A*Teens,Dancing Queen,3:44,Pop,6
7,2000,Aaliyah,I Don't Wanna,4:15,Rock,7
8,2000,Aaliyah,Try Again,4:03,Rock,8
9,2000,"Adams, Yolanda",Open My Heart,5:30,Gospel,9


We’ll then create a `ranks` table which only contains the `song_id`, `date` and the `rank`.

In [21]:
ranks = pd.merge(billboard, songs, on=["year","artist.inverted", "track", "time", "genre"])
ranks = ranks[["song_id", "date","rank"]]
ranks.head(10)

Unnamed: 0,song_id,date,rank
0,0,2000-02-26,87
1,0,2000-03-04,82
2,0,2000-03-11,72
3,0,2000-03-18,77
4,0,2000-03-25,87
5,0,2000-04-01,94
6,0,2000-04-08,99
7,1,2000-09-02,91
8,1,2000-09-09,87
9,1,2000-09-16,92


__Tubercolosis Records from World Health Organization__

This dataset documents the count of confirmed tuberculosis cases by country, year, age and sex.

__Problems__:
* Some columns contain multiple values: sex and age.
* Mixture of zeros and missing values NaN. This is due to the data collection process and the distinction is important for this dataset.

In [42]:
df = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/tb-raw.csv")
df.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014
0,AD,2000,0.0,0.0,1.0,0.0,0,0,0.0,,
1,AE,2000,2.0,4.0,4.0,6.0,5,12,10.0,,3.0
2,AF,2000,52.0,228.0,183.0,149.0,129,94,80.0,,93.0
3,AG,2000,0.0,0.0,0.0,0.0,0,0,1.0,,1.0
4,AL,2000,2.0,19.0,21.0,14.0,24,19,16.0,,3.0


n order to tidy this dataset, we need to __remove the different values from the header and unpivot them into rows__. 

We’ll first need to `melt` the `sex` + `age` group columns into a single one. 

Once we have that single column, we’ll derive three columns from it: `sex`, `age_lower` and `age_upper`. 

With those, we’ll be able to properly build a tidy dataset.

In [43]:
df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")
df.head()

Unnamed: 0,country,year,sex_and_age,cases
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0


In [44]:
# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})")    
tmp_df.head()

Unnamed: 0,0,1,2
0,m,0,14
1,m,0,14
2,m,0,14
3,m,0,14
4,m,0,14


In [45]:
# Name columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]
tmp_df.head()

Unnamed: 0,sex,age_lower,age_upper
0,m,0,14
1,m,0,14
2,m,0,14
3,m,0,14
4,m,0,14


In [46]:
# Create `age`column based on `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]
tmp_df.head()

Unnamed: 0,sex,age_lower,age_upper,age
0,m,0,14,0-14
1,m,0,14,0-14
2,m,0,14,0-14
3,m,0,14,0-14
4,m,0,14,0-14


In [47]:
# Merge 
df = pd.concat([df, tmp_df], axis=1)
df.head()

Unnamed: 0,country,year,sex_and_age,cases,sex,age_lower,age_upper,age
0,AD,2000,m014,0.0,m,0,14,0-14
1,AE,2000,m014,2.0,m,0,14,0-14
2,AF,2000,m014,52.0,m,0,14,0-14
3,AG,2000,m014,0.0,m,0,14,0-14
4,AL,2000,m014,2.0,m,0,14,0-14


In [66]:
# Drop unnecessary columns and rows
df.drop(['sex_and_age',"age_lower","age_upper"], axis=1, inplace=True)
df.dropna(inplace=True)
df.sort_values(ascending=True,by=["country", "year", "sex", "age"], inplace=True)
df = df.reset_index(drop=True)
df.drop(columns='index', axis=1, inplace=True)
df.head(10)

Unnamed: 0,country,year,cases,sex,age
0,AD,2000,0.0,m,0-14
1,AD,2000,0.0,m,15-24
2,AD,2000,1.0,m,25-34
3,AD,2000,0.0,m,35-44
4,AD,2000,0.0,m,45-54
5,AD,2000,0.0,m,55-64
6,AE,2000,3.0,f,0-14
7,AE,2000,2.0,m,0-14
8,AE,2000,4.0,m,15-24
9,AE,2000,4.0,m,25-34


__Global Historical Climatology Network Dataset__

This dataset represents the daily weather records for a weather station (MX17004) in Mexico for five months in 2010.

__Problems__:

* Variables are stored in both rows (tmin, tmax) and columns (days).

In [72]:
df = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/weather-raw.csv")
df.head()

Unnamed: 0,id,year,month,element,d1,d2,d3,d4,d5,d6,d7,d8
0,MX17004,2010,1,tmax,,,,,,,,
1,MX17004,2010,1,tmin,,,,,,,,
2,MX17004,2010,2,tmax,,27.3,24.1,,,,,
3,MX17004,2010,2,tmin,,14.4,14.4,,,,,
4,MX17004,2010,3,tmax,,,,,32.1,,,


In order to make this dataset tidy, we want to move the three misplaced variables (`tmin`, `tmax` and `days`) as three individual columns: `tmin`, `tmax`, and `date`.

In [71]:
# Extracting day
df["day"] = df["day_raw"].str.extract("d(\d+)", expand=False)  
df["id"] = "MX17004"
df.head()

KeyError: 'day_raw'

In [None]:
# To numeric values
df[["year","month","day"]] = df[["year","month","day"]].apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Creating a date from the different columns
def create_date_from_year_month_day(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

df["date"] = df.apply(lambda row: create_date_from_year_month_day(row), axis=1)
df = df.drop(['year',"month","day", "day_raw"], axis=1)
df = df.dropna()

# Unmelting column "element"
df = df.pivot_table(index=["id","date"], columns="element", values="value")
df.reset_index(drop=False, inplace=True)
df

__Illinois Male Baby Names for the year 2014/2015__.

__Problems__:

* The data is spread across multiple tables/files.
* The “Year” variable is present in the file name.

In order to load those different files into a single DataFrame, we can run a custom script that will append the files together. Furthermore, we’ll need to extract the “Year” variable from the file name.

In [76]:
def extract_year(string):
    match = re.match(".+(\d{4})", string) 
    if match != None: return match.group(1)

In [78]:
import glob

allFiles = glob.glob("201*-baby-names-illinois.csv")
allFiles

['2014-baby-names-illinois.csv', '2015-baby-names-illinois.csv']

In [79]:
frame = pd.DataFrame()
df_list= []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    df.columns = map(str.lower, df.columns)
    df["year"] = extract_year(file_)
    df_list.append(df)
    
df = pd.concat(df_list)
df.head(5)

Unnamed: 0,rank,name,frequency,sex,year
0,1,Noah,837,Male,
1,2,Alexander,747,Male,
2,3,William,687,Male,
3,4,Michael,680,Male,
4,5,Liam,670,Male,


<a name="pandas"></a>
# 5.4.2 Pandas & Tidy Data

<a name="tidy"></a>
A tidy dataset follows __three fundamental rules__:

1. Each variable forms a column.
1. Each observation forms a row.
1. Each type of observational unit forms a table.

An __observational unit__ is the individual object or instance that we capture information about.

A dataset that does not follow these rules is referred to as “__messy data__” and can be considered “messy” for violating at least one of the above rules.

<a name="wide"></a>
## Cleaning Wide-Form Data

A dataset is considered to be in wide-form when __at least one variable is represented across multiple columns as column headers rather than in a single column__.

In [2]:
import pandas as pd
 
# create wide-form df
data = pd.DataFrame({"Name":["Annie", "John", "Min-ji", "Ravi", "Lucas"],
    "Test1" : [85,92,88,86,91],
    "Test2" : [78,86,79,90,93],
    "Test3" : [98,90,95,78,88]})
print(data)

     Name  Test1  Test2  Test3
0   Annie     85     78     98
1    John     92     86     90
2  Min-ji     88     79     95
3    Ravi     86     90     78
4   Lucas     91     93     88


For data analysis, programs can summarize and analyze data better when each row of the dataset represents a single observation. 

For example, `groupby` functions separate rows of data by the value within one or more columns rather than by the column header.

In [3]:
# tidy data
data_tidy = pd.melt(data, id_vars="Name", var_name="Test", value_name="Score")
print(data_tidy)

      Name   Test  Score
0    Annie  Test1     85
1     John  Test1     92
2   Min-ji  Test1     88
3     Ravi  Test1     86
4    Lucas  Test1     91
5    Annie  Test2     78
6     John  Test2     86
7   Min-ji  Test2     79
8     Ravi  Test2     90
9    Lucas  Test2     93
10   Annie  Test3     98
11    John  Test3     90
12  Min-ji  Test3     95
13    Ravi  Test3     78
14   Lucas  Test3     88


In [4]:
# mean of each student
data_tidy.groupby(by = "Name").mean()

Unnamed: 0_level_0,Score
Name,Unnamed: 1_level_1
Annie,87.0
John,89.333333
Lucas,90.666667
Min-ji,87.333333
Ravi,84.666667


In [5]:
# mean of each test
data_tidy.groupby(by = "Test").mean()

Unnamed: 0_level_0,Score
Test,Unnamed: 1_level_1
Test1,88.4
Test2,85.2
Test3,89.8


<a name="long"></a>
## Cleaning Long-Form Data

A dataset is “__too long__” when __a single column in the dataset represents more than one variable__, thus creating extra rows despite containing the same amount of information as compared to the same dataset in tidy form.

In [6]:
## Create Long Dataframe
data = pd.DataFrame({"participant": [1,2,3,1,2,3],
                      "attribute": ["age", "age", "age", "income", "income", "income"],
                      "value": [24, 57, 23, 30, 60, 28]})
## Print Dataframe
print(data)

   participant attribute  value
0            1       age     24
1            2       age     57
2            3       age     23
3            1    income     30
4            2    income     60
5            3    income     28


The `pivot` method in the pandas package will allow us to __reshape the dataset based on the values of a column__. The parameters for this function are:

* `index`: the name of the column to make the new data frame’s index ( in this scenario, “participant”).
* `columns`: the name of the column to make the new data frame’s column headers (in this scenario, “attribute”).
* `values`: the name of the column that will populate the new data frame’s values (in this scenario, “value”)

In [7]:
data_tidy = data.pivot(index="participant",
                         columns="attribute",
                         values="value").reset_index()
data_tidy.columns.name = None

Using the function `.reset_index()` and specifying `.columns.name = None` clears up some of the indexing that was carried over from the original form of the data set.

In [8]:
data_tidy

Unnamed: 0,participant,age,income
0,1,24,30
1,2,57,60
2,3,23,28
