# Assignment #2 - Data Gathering and Warehousing - DSSA-5102

Instructor: Melissa Laurino</br>
Spring 2024</br></br>
Name: Louise Ramos</br>
Date: 02/01/2024

Our next objective is to choose <b>ONE</b> of the datasets from our previous assignment to explore further. The datasets we have chose for Assignment #1 are manageable to clean in R (Or Python if that is what you prefer to explore, see the technology check for working with Python in R in Jupyter notebook). Depending on your data, and especially the size of it, it may be more beneficial to clean in a language we are comfortable working in already instead of cleaning our data in SQL. SQL may be needed for cleaning of databases that are very large or hundreds of terabytes in size. We will clean our datasets first before we attempt to load them into our SQL databases. </br>
Not only is data everywhere, but it can also be messy. Messy data can originate in the data collection process, whether this is occurring with manual data entry and typos, or with outdated collection forms that hold multiple variables that mean the same thing. For example, while collecting data on marine mammals, it is important to note who the observer is. With Python and R, reading excel or csv files, these languages will take the same variable written as, "Melissa Laurino" and "melissa laurino" as two separate observers because they are case sensitive. However, this is not accurate because they are meant to be the same person within the observer column or category.</br>
Clean data is important for consistency that leads to accurate results and analysis. If we are using our data to make informed decisions in our field, we need it to be clean. We do not want to omit rows that may make a difference to our dataset because they do not fit a certain criteria due to typos, but how much should the original dataset be altered? Depending on your field, there may be regulations and compliance standards regarding data quality. Protocols may state if the data does not read exactly how it should be, then it should be omitted. </br>
For our learning objectives in this class, we will clean our data. Our first assignment in our warehousing journey was important because it allowed us to gain a better understanding of a dataset that we personally did not collect. Now that we have that understanding, we can explore it in greater depth and clean it as necessary.<br>
<br>
It is important when cleaning data to: <br>
*Make detailed comments with your code* <br>
*Record EVERYTHING omitted and changed if necessary* <br>
*Since we are exploring and learning without a specific organization policy, use your best judgement when omitting records. If you have chosen to omit data, please explain why.*</br>
<br>
<b>The code that I have written below is just to give you ideas on exploring and cleaning data. It is encouraged that you explore and clean it in greater detail than what I have written below for full credit.</b><br>
Additional examples: https://epirhandbook.com/en/cleaning-data-and-core-functions.html

Dataset name:</b> Provisional COVID 19 Deaths<br>
Company/Government Organization:</b> CDC<br>
Download link: https://data.cdc.gov/NCHS/Provisional-COVID-19-death-counts-and-rates-by-mon/yrur-wghw/about_data

Load necessary libraries:

In [95]:
import pandas as pd
from datetime import datetime

Load data into Python:

In [96]:
covid_frame = pd.read_csv("Provisional_Covid_Deaths.csv")

Exploration before cleaning:

In [97]:
# Checking the number of elements in the DataFrame
frame_elements = covid_frame.size

# Printing the number of elements and data types.
print(f"There are {frame_elements} data items in the Covid Deaths DataFrame.\n")

There are 555984 data items in the Covid Deaths DataFrame.



In [98]:
# Counting missing values (read is as NaN) across the DataFrame
missing_count = covid_frame.isna().sum()

# Get frame size
frame_size = len(covid_frame)
# Put missing data in terms of percent and round to one decimal place.
missing_percent = round((missing_count/frame_size) *100, 1)

# Printing the count
print(f"{missing_count} \n")
print(missing_percent)

data_as_of                    0
jurisdiction_residence        0
year                          0
month                         0
group                         0
subgroup1                     0
subgroup2                  9504
COVID_deaths              10136
crude_COVID_rate          13680
aa_COVID_rate             39175
crude_COVID_rate_ann      13680
aa_COVID_rate_ann         39175
footnote                  29088
dtype: int64 

data_as_of                 0.0
jurisdiction_residence     0.0
year                       0.0
month                      0.0
group                      0.0
subgroup1                  0.0
subgroup2                 22.2
COVID_deaths              23.7
crude_COVID_rate          32.0
aa_COVID_rate             91.6
crude_COVID_rate_ann      32.0
aa_COVID_rate_ann         91.6
footnote                  68.0
dtype: float64


What columns are missing values (If any)? Do you think you should remove the rows of data at this time in the exploration? Why or why not?

The main columns missing values are rate columns due to such a small number of reported covid deaths for that subgroup/ pair of subgroups. Also, some rows only contain a single subgroup and so are counted as missing subgroup2. I will not remove all the rows containing NAs in any column but do plan on checking that the missing items in subgroup2 are meant to be missing and not an error.

If you chose to remove rows with specific missing values:

In [99]:
# Check that rows that have "And" in the group column contain information in both subgroups. If not, remove the row as an error.
rows_to_remove = covid_frame.loc[(covid_frame["group"].str.contains("and")) & covid_frame["subgroup2"].isna()]
print(rows_to_remove)

# Removing footnote column since it is just contextual information
covid_frame.drop(["data_as_of", "footnote"], axis=1, inplace=True)

Empty DataFrame
Columns: [data_as_of, jurisdiction_residence, year, month, group, subgroup1, subgroup2, COVID_deaths, crude_COVID_rate, aa_COVID_rate, crude_COVID_rate_ann, aa_COVID_rate_ann, footnote]
Index: []


What about duplicates?

In [100]:
#Check for any rows that are completely duplicated
duplicate_rows = covid_frame[covid_frame.duplicated()]

#Print duplicates:
print(duplicate_rows)

# There are no rows that are completely duplicated. Checking if any rows in the same location also have the same subgroups and timeframe
duplicate_subset = covid_frame[covid_frame.duplicated(subset=["jurisdiction_residence", "year", "month", "group", "subgroup1", "subgroup2"])]

#Print duplicates
print(duplicate_subset)

Empty DataFrame
Columns: [jurisdiction_residence, year, month, group, subgroup1, subgroup2, COVID_deaths, crude_COVID_rate, aa_COVID_rate, crude_COVID_rate_ann, aa_COVID_rate_ann]
Index: []
Empty DataFrame
Columns: [jurisdiction_residence, year, month, group, subgroup1, subgroup2, COVID_deaths, crude_COVID_rate, aa_COVID_rate, crude_COVID_rate_ann, aa_COVID_rate_ann]
Index: []


In [101]:
# No duplicate rows to remove since the subset of columns I chose to check on is all important to the data analyses I want to perform.

Let's revisit the structure and look at the data types for each column. This will be important for SQL.

In [102]:
# Checking type of data in column
frame_types = covid_frame.dtypes

# Print column data types
print(frame_types)

jurisdiction_residence     object
year                        int64
month                       int64
group                      object
subgroup1                  object
subgroup2                  object
COVID_deaths              float64
crude_COVID_rate          float64
aa_COVID_rate             float64
crude_COVID_rate_ann      float64
aa_COVID_rate_ann         float64
dtype: object


In [103]:
# Combine year and month to a single column and convert to a date
covid_frame["Observation Date"] = covid_frame["month"].astype(str) + "-" + covid_frame["year"].astype(str)
covid_frame['Date'] = pd.to_datetime(covid_frame["Observation Date"].apply(lambda x: datetime.strptime(x, '%m-%Y')))

# Drop excess columns
covid_frame.drop(["Observation Date", "year", "month"], axis=1, inplace=True)

# Change first four columns to all strings and covid statistic columns to float64
covid_frame = covid_frame.astype({"jurisdiction_residence": str, "group": str, "subgroup1": str, "subgroup2": str, "COVID_deaths": "float64", "crude_COVID_rate": "float64",
                                  "aa_COVID_rate_ann": "float64", "aa_COVID_rate": "float64"})

#What are the updated column types?
print(covid_frame.dtypes)


jurisdiction_residence            object
group                             object
subgroup1                         object
subgroup2                         object
COVID_deaths                     float64
crude_COVID_rate                 float64
aa_COVID_rate                    float64
crude_COVID_rate_ann             float64
aa_COVID_rate_ann                float64
Date                      datetime64[ns]
dtype: object


Changing text characters in your data. Make all column names lowercase. Lowercase is easier to read in SQL when we get to that point.

In [104]:
# Make all column names lowercase:
covid_frame.columns = map(str.lower, covid_frame.columns)

# Print column names
print(covid_frame.columns)


Index(['jurisdiction_residence', 'group', 'subgroup1', 'subgroup2',
       'covid_deaths', 'crude_covid_rate', 'aa_covid_rate',
       'crude_covid_rate_ann', 'aa_covid_rate_ann', 'date'],
      dtype='object')


Assignment #1 asked you to create a graph and check for outliers. Are there any outliers in your columns? How can we check for outliers?

In [105]:
# Getting summary statistics for covid_deaths columns as rate columns are a derivative
covid_frame.describe()[["covid_deaths"]]

# There are multiple months with much higher counts due to the nature of the data. There are many months with zero to very few deaths so this skews the statistics 
# below. Since this is reported deaths and those do not necessarily follow a normal distribution, I would not suggest removing these extremes unless we are wanting 
# to have a smooth distribution or trend for a particular output or graph.


Unnamed: 0,covid_deaths
count,32632.0
mean,289.119515
min,0.0
25%,0.0
50%,0.0
75%,49.0
max,67990.0
std,1807.508403


<b>To create additional steps for data cleaning in Jupyter notebook: </b><br>
Hit the plus button in the top left corner to add a row of code. <br>
To change from code to text or headers, select from the drop down menu above. <br>
Use "< b r >" (No spaces or quotes) to skip a line in markdown and other HTML text font options.

Additional step #1:

In [109]:
# Fill nans in string columns with "" rather than NaN
covid_frame["subgroup2"] = covid_frame["subgroup2"].replace("nan", '', regex=True)

      jurisdiction_residence         group           subgroup1  \
0              United States           Sex              Female   
1              United States           Sex                Male   
2              United States           Age           0-4 years   
3              United States           Age         12-17 years   
4              United States           Age         18-29 years   
...                      ...           ...                 ...   
42763              Region 10  Race and Age  Non-Hispanic White   
42764              Region 10  Race and Age  Non-Hispanic White   
42765              Region 10  Race and Age  Non-Hispanic White   
42766              Region 10  Race and Age  Non-Hispanic White   
42767              Region 10  Race and Age  Non-Hispanic White   

               subgroup2  covid_deaths  crude_covid_rate  aa_covid_rate  \
0                                  3.0               NaN            NaN   
1                                  3.0               NaN 

Lets save our new CLEAN data :) 

In [107]:
#Save the newly cleaned dataset as a NEW file:
#covid_frame.to_csv('Provisional_Covid_Deaths_Cleaned.csv', index=False)