# Clean, Sort And Wrangle The Data - Final Project
##Introduction
###Air Quality
In this project, I am going to work on the 'Air Quality' dataset. The following data file contains data collected at a roadside monitoring station that can be seen in a spreadsheet here: https://docs.google.com/spreadsheets/d/1XpAvrpuyMsKDO76EZ3kxuddBOu7cZX1Od4uEts14zco/edit?usp=sharing

The data contains:
* a heading line (Chatham Roadside) which needs to be skipped
* dates which are sometimes left- and sometimes right-justified indicating that they are not formatted as dates, rather they are text (so need to be converted to dates)
* times which are not all in the same format
* Nitrogen Dioxide levels which are, again, text and sometimes contain nodata
* Status which is always the same





### The Workflow
I am going to:

* Read the dataset into a dataframe, skipping the first row   
* Convert dates to date format  
* Remove rows with nodata in the Nitrogen dioxide column  
* Convert the Nitrogen dioxide levels values to float type  
* Sort by Nitrogen dioxide level  
* Create a new column for 'Weekdays'  
* Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)  
* Remove the Status column  

The dataset can be viewed here:  https://drive.google.com/file/d/1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ/view?usp=sharing  and the data accessed here: https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA This is a .csv file  


In [1]:
url = 'https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA'
import numpy as np
import pandas as pd
# The function loads the data from CSV or Excel sheets into a Pandas DataFrame.
def create_dataframe(url, db_type='csv', sheetname=None):

  if db_type == 'csv':
    df = pd.read_csv(url, skiprows=1)
  elif db_type == 'excel':
    if sheetname == None:
      df = pd.read_excel(url)
    else:
      df = pd.read_excel(url, sheet_name=sheetname)
  else:
    df = pd.read_csv(url)
  return df

air_df_20 = create_dataframe(url)
display(air_df_20.info())
display(air_df_20.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8784 non-null   object
 1   Time              8784 non-null   object
 2   Nitrogen dioxide  8784 non-null   object
 3   Status            8784 non-null   object
dtypes: object(4)
memory usage: 274.6+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,01/01/2020,1:00,35.65193,V µg/m³
1,01/01/2020,2:00,37.99122,V µg/m³
2,01/01/2020,3:00,35.70462,V µg/m³
3,01/01/2020,4:00,36.5796,V µg/m³
4,01/01/2020,5:00,32.9441,V µg/m³


In [2]:
# The function cleans up the dataset
def clean_up(df):
  # Drop nulls
  print('Drop nulls.')
  df.dropna(inplace=True)
  display(df.info())
  display(df.head())
  print('')
  print('Convert "Date" column to Date format')
  df['Date'] = pd.to_datetime(df['Date']).dt.date
  display(df.info())
  display(df.head())
  print('')
  print('Convert "Time" column to Time and in correct format') 
  df['Time'] = df['Time'].replace('24:00:00', '00:00')
  df['Time'] = df['Time'].replace('24:00', '00:00')
  df['Time'] = pd.to_datetime(df['Time']).dt.time
  display(df.info())
  display(df.head())
  print('')
  print('Drop rows with "nodata", convert "Nitrogen dioxide" column to numeric and sort by it')    
  df.drop(df[df['Nitrogen dioxide']=='nodata'].index, inplace=True)
  df['Nitrogen dioxide'] = pd.to_numeric(df['Nitrogen dioxide'])
  df.sort_values(by=['Nitrogen dioxide'], ascending=False, inplace=True)
  display(df.info())
  display(df.head())
  print('')
  print('Add a new "Weekdays" column')   
  df['Weekdays'] = pd.to_datetime(df['Date']).dt.weekday
  display(df.info())
  display(df.head())
  print('')
  print('Renamme the "Nitrogen dioxide" column and drop the "Status" column.') 
  df.rename(columns= {'Nitrogen dioxide':'NO2 Level (V ug/m2)'}, inplace=True)
  df.drop(columns=['Status'], inplace=True)
  display(df.info())
  display(df.head())
  return df

air_df_20 = clean_up(air_df_20)

Drop nulls.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8784 non-null   object
 1   Time              8784 non-null   object
 2   Nitrogen dioxide  8784 non-null   object
 3   Status            8784 non-null   object
dtypes: object(4)
memory usage: 343.1+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,01/01/2020,1:00,35.65193,V µg/m³
1,01/01/2020,2:00,37.99122,V µg/m³
2,01/01/2020,3:00,35.70462,V µg/m³
3,01/01/2020,4:00,36.5796,V µg/m³
4,01/01/2020,5:00,32.9441,V µg/m³



Convert "Date" column to Date format
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8784 non-null   object
 1   Time              8784 non-null   object
 2   Nitrogen dioxide  8784 non-null   object
 3   Status            8784 non-null   object
dtypes: object(4)
memory usage: 343.1+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,2020-01-01,1:00,35.65193,V µg/m³
1,2020-01-01,2:00,37.99122,V µg/m³
2,2020-01-01,3:00,35.70462,V µg/m³
3,2020-01-01,4:00,36.5796,V µg/m³
4,2020-01-01,5:00,32.9441,V µg/m³



Convert "Time" column to Time and in correct format
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8784 non-null   object
 1   Time              8784 non-null   object
 2   Nitrogen dioxide  8784 non-null   object
 3   Status            8784 non-null   object
dtypes: object(4)
memory usage: 343.1+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,2020-01-01,01:00:00,35.65193,V µg/m³
1,2020-01-01,02:00:00,37.99122,V µg/m³
2,2020-01-01,03:00:00,35.70462,V µg/m³
3,2020-01-01,04:00:00,36.5796,V µg/m³
4,2020-01-01,05:00:00,32.9441,V µg/m³



Drop rows with "nodata", convert "Nitrogen dioxide" column to numeric and sort by it
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8672 entries, 502 to 3442
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              8672 non-null   object 
 1   Time              8672 non-null   object 
 2   Nitrogen dioxide  8672 non-null   float64
 3   Status            8672 non-null   object 
dtypes: float64(1), object(3)
memory usage: 338.8+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
502,2020-01-21,23:00:00,70.41527,V µg/m³
2347,2020-07-04,20:00:00,69.88823,V µg/m³
503,2020-01-21,00:00:00,69.17734,V µg/m³
504,2020-01-22,01:00:00,67.62859,V µg/m³
501,2020-01-21,22:00:00,66.59166,V µg/m³



Add a new "Weekdays" column
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8672 entries, 502 to 3442
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              8672 non-null   object 
 1   Time              8672 non-null   object 
 2   Nitrogen dioxide  8672 non-null   float64
 3   Status            8672 non-null   object 
 4   Weekdays          8672 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 406.5+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status,Weekdays
502,2020-01-21,23:00:00,70.41527,V µg/m³,1
2347,2020-07-04,20:00:00,69.88823,V µg/m³,5
503,2020-01-21,00:00:00,69.17734,V µg/m³,1
504,2020-01-22,01:00:00,67.62859,V µg/m³,2
501,2020-01-21,22:00:00,66.59166,V µg/m³,1



Renamme the "Nitrogen dioxide" column and drop the "Status" column.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8672 entries, 502 to 3442
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 8672 non-null   object 
 1   Time                 8672 non-null   object 
 2   NO2 Level (V ug/m2)  8672 non-null   float64
 3   Weekdays             8672 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 338.8+ KB


None

Unnamed: 0,Date,Time,NO2 Level (V ug/m2),Weekdays
502,2020-01-21,23:00:00,70.41527,1
2347,2020-07-04,20:00:00,69.88823,5
503,2020-01-21,00:00:00,69.17734,1
504,2020-01-22,01:00:00,67.62859,2
501,2020-01-21,22:00:00,66.59166,1


### Expand the dataset and show summary statistics for larger dataset
---

There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ  

I am going to:

* Concatenate the two datasets to expand it to 2020 and 2021. Before I can concatenate the datasets I will need to clean and wrangle the second dataset in the same way as the first and give the second dataset a different name. 

* After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.

In [3]:
url = 'https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ'
air_df_21 = create_dataframe(url)
display(air_df_21.info())
display(air_df_21.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8760 non-null   object
 1   Time              8760 non-null   object
 2   Nitrogen dioxide  8760 non-null   object
 3   Status            8760 non-null   object
dtypes: object(4)
memory usage: 273.9+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,01/01/2021,01:00,16.58269,V µg/m³
1,01/01/2021,02:00,14.00478,V µg/m³
2,01/01/2021,03:00,15.35208,V µg/m³
3,01/01/2021,04:00,13.49688,V µg/m³
4,01/01/2021,05:00,12.47511,V µg/m³


In [4]:
# Clean the dataset
air_df_21 = clean_up(air_df_21)

Drop nulls.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8760 non-null   object
 1   Time              8760 non-null   object
 2   Nitrogen dioxide  8760 non-null   object
 3   Status            8760 non-null   object
dtypes: object(4)
memory usage: 342.2+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,01/01/2021,01:00,16.58269,V µg/m³
1,01/01/2021,02:00,14.00478,V µg/m³
2,01/01/2021,03:00,15.35208,V µg/m³
3,01/01/2021,04:00,13.49688,V µg/m³
4,01/01/2021,05:00,12.47511,V µg/m³



Convert "Date" column to Date format
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8760 non-null   object
 1   Time              8760 non-null   object
 2   Nitrogen dioxide  8760 non-null   object
 3   Status            8760 non-null   object
dtypes: object(4)
memory usage: 342.2+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,2021-01-01,01:00,16.58269,V µg/m³
1,2021-01-01,02:00,14.00478,V µg/m³
2,2021-01-01,03:00,15.35208,V µg/m³
3,2021-01-01,04:00,13.49688,V µg/m³
4,2021-01-01,05:00,12.47511,V µg/m³



Convert "Time" column to Time and in correct format
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              8760 non-null   object
 1   Time              8760 non-null   object
 2   Nitrogen dioxide  8760 non-null   object
 3   Status            8760 non-null   object
dtypes: object(4)
memory usage: 342.2+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
0,2021-01-01,01:00:00,16.58269,V µg/m³
1,2021-01-01,02:00:00,14.00478,V µg/m³
2,2021-01-01,03:00:00,15.35208,V µg/m³
3,2021-01-01,04:00:00,13.49688,V µg/m³
4,2021-01-01,05:00:00,12.47511,V µg/m³



Drop rows with "nodata", convert "Nitrogen dioxide" column to numeric and sort by it
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8680 entries, 7983 to 7177
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              8680 non-null   object 
 1   Time              8680 non-null   object 
 2   Nitrogen dioxide  8680 non-null   float64
 3   Status            8680 non-null   object 
dtypes: float64(1), object(3)
memory usage: 339.1+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status
7983,2021-11-29,16:00:00,82.59609,P µg/m³
1784,2021-03-16,09:00:00,80.27844,V µg/m³
7695,2021-11-17,16:00:00,73.4094,P µg/m³
2117,2021-03-30,06:00:00,72.66929,V µg/m³
8395,2021-12-16,20:00:00,72.00084,P µg/m³



Add a new "Weekdays" column
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8680 entries, 7983 to 7177
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              8680 non-null   object 
 1   Time              8680 non-null   object 
 2   Nitrogen dioxide  8680 non-null   float64
 3   Status            8680 non-null   object 
 4   Weekdays          8680 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 406.9+ KB


None

Unnamed: 0,Date,Time,Nitrogen dioxide,Status,Weekdays
7983,2021-11-29,16:00:00,82.59609,P µg/m³,0
1784,2021-03-16,09:00:00,80.27844,V µg/m³,1
7695,2021-11-17,16:00:00,73.4094,P µg/m³,2
2117,2021-03-30,06:00:00,72.66929,V µg/m³,1
8395,2021-12-16,20:00:00,72.00084,P µg/m³,3



Renamme the "Nitrogen dioxide" column and drop the "Status" column.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8680 entries, 7983 to 7177
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 8680 non-null   object 
 1   Time                 8680 non-null   object 
 2   NO2 Level (V ug/m2)  8680 non-null   float64
 3   Weekdays             8680 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 339.1+ KB


None

Unnamed: 0,Date,Time,NO2 Level (V ug/m2),Weekdays
7983,2021-11-29,16:00:00,82.59609,0
1784,2021-03-16,09:00:00,80.27844,1
7695,2021-11-17,16:00:00,73.4094,2
2117,2021-03-30,06:00:00,72.66929,1
8395,2021-12-16,20:00:00,72.00084,3


In [5]:
# The function concatenates the two DataFrames in to one
def concat(df1, df2):
  df = pd.concat([df1, df2], ignore_index=True)
  df.sort_values(by=['NO2 Level (V ug/m2)'], ascending=False, inplace=True)
  return df

df_all_cleaned = concat(air_df_20, air_df_21)
display(df_all_cleaned.info())
display(df_all_cleaned.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17352 entries, 8672 to 17351
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 17352 non-null  object 
 1   Time                 17352 non-null  object 
 2   NO2 Level (V ug/m2)  17352 non-null  float64
 3   Weekdays             17352 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 677.8+ KB


None

Unnamed: 0,Date,Time,NO2 Level (V ug/m2),Weekdays
8672,2021-11-29,16:00:00,82.59609,0
8673,2021-03-16,09:00:00,80.27844,1
8674,2021-11-17,16:00:00,73.4094,2
8675,2021-03-30,06:00:00,72.66929,1
8676,2021-12-16,20:00:00,72.00084,3


In [6]:
# The function shows the summary report
def show_summary(df):
  df_grouped = df.groupby('Weekdays')['NO2 Level (V ug/m2)'].describe()
  return df_grouped

display(show_summary(df_all_cleaned))

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Weekdays,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2443.0,14.303682,12.169515,0.3839,5.45055,10.50999,19.468465,82.59609
1,2429.0,15.297219,12.995011,-0.10519,5.7222,11.01545,21.1778,80.27844
2,2484.0,16.355849,12.679383,-0.77743,6.60506,12.45135,23.30883,73.4094
3,2503.0,14.871229,12.242185,-0.31174,6.066915,11.28099,19.83112,72.00084
4,2517.0,14.893637,11.932584,0.03299,5.76919,10.97397,20.97729,71.21977
5,2492.0,13.851765,11.724818,0.31041,4.783948,10.36683,19.487335,69.88823
6,2484.0,11.661928,9.255667,-0.4174,4.699523,9.146155,16.2188,69.11823


##Reflection
In this project I learnt how to clean and wrangle the data, a very valueable, skill a part of the Data Analysis process. I now feel comfortable and confident to proceed in the Data Analysis process. Overall, I did not find any difficulty accomplishing the tasks.