# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
In 2-4 sentences, explain the kind of problem you want to look at and the datasets you will be wrangling for this project.

I would like to perform an analysis on the UFO sightings and how they correlate to the birth totals/rates in the United States. To do this, I will need to capture data regarding births from the United States governement, the CDC Natality Birth data reports or CDC Wonder web tool. The UFO data will need to be either obtained from a dataset combined previously by another or capture directly from the NUFORC website. I decided to use the CDC Wonder web tool and web scraping from NUFORC, but I did notice that the granularity of the data is uneven. The UFO data goes down to the city level while the CDC data only goes down to the county level. I will use a another data source, or two, to capture the FIPS code data so that I may associate the cities provided with the county they reside in. The main source I will be obtaining this data from is the Department of Transportations API connection. 

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### **Dataset 1**: FIPS Data

Type: *API :: CSV Extraction to Dataframe* 

Method: Data was captured from through the use of the Department of Transportation's API tool. The API required that the SodaPy Socrata python library be used to access, authenticate, and download the data. The data was downloaded using a loop to capture the data in chunks due to an allowed limit restriction for each query result. The data was then combined and into a single dataframe. Prior to performing the cleaning, I do export the data as a pickle and csv file as a localized backup of the download.  

**Beginning Dataset variables**:

*   *state_name* :: Text name of the State
*   *county_name* :: Text name of the County
*   *city_name* :: Text name of the City
*   *state_code* :: Two Character abbreviation of the State
*   *state_fipcode* :: FIP Code designation of the state, two digit number
*   *county_code* :: FIP Code assigned to the County, "C" + Three Digit Number 
*   *county_fipcode* :: Full FIP Code designation down to county granularity, this is a five digit code comprised of the two digit state code and the three digit code from the county
*   *city_code* :: FIP Code assigned to the City, this is a four digit number. 
*   *city_fipcode* :: Full FIP Code designation down to city granularity. This is a nine digit code comprised of the five digit fip code above with the four digit city code on the end.



The original data is converted into three dataset that will be utilized during the ETL process. The first of these is the FIPS Data Main, which will be the main cleaning and conversion of the original source data into it's cleaned format. The second dataset is a table that will be used for vectorized searching and matching of the state name or state code to it's respective fipcode. 

**FIPS Data Main Variables**:
*   *state_name* :: Text name of the State
*   *state_code* :: Two Character abbreviation of the State
*   *state_fipcode* :: FIP Code designation of the state, two digit number
*   *county_name* :: Text name of the County
*   *county_fipcode* :: FIP Code assigned to the County, "C" + Three Digit Number 
*   *city_name* :: Text name of the City
*   *city_fipcode* :: FIP Code assigned to the City, this is a four digit number.
*   *fips_five* :: The five digit fip code designating the state and county.
*   *fips_nine* :: The nine digit fip code designating the state, county, and city.
*   *multi_county_flag* :: A boolean column to indicate if a city exists in multiple counties.
*   *county_count* :: If the city exists in multiple counties, the number of counties will show here, otherwise the number is one. 
*   *county_rank* :: If the city exists in multiple counties, the counties are ranked in asceding order based on the county code.

**State Lookup Table Variables**:
*   *state_lookup* :: Text name or two-character abbreviation (state code) for the state. 
*   *fips* :: State FIP Code

This table has the index created on the state_lookup column to create fast, lite matching and searching. 

**County City Lookup Table Variables**:
*   *state_fipcode* :: Contains the two digit fipcode of the state. 
*   *county_lookup* :: Contains the text name of the county. 
*   *county_fipcode* :: Contains the County FIP code, C###. 
*   *city_lookup* :: Contains the city text name. 
*   *city_fipcode* :: Contains the four digit numerical fip code for the city. 

This table works with the same principle as the state_lookup table but just includes the county and city data as well. This table is replaced later with another updated version during the process. The table is also fed into the fuzzy query module used when cleaning the city data. 

#### Dataset 2 :: CDC Data

Type: *Text Files*

Method: Web scraping/Automation using the Python module, Selenium, with the CDC Wonder web tool. The data was generated using the official CDC Wonder Web tool to perform automated execution of the web form's options and trigger a download of the text file results. There are three total database the data is pulled from. The data from the 1995-2002 datasets has a different format than the data from the remaining two database from 2003 to 2023. 

Dataset variables:

**1995-2002 Dataset:** 

*   *Notes* :: Column was used to designate special rows or changes in the data structure. 
*   *Year* :: Column contained the integer/text four digit year value. 
*   *Year Code* :: Column contained the Codified version of the year value, which was the same four digit year as before.
*   *State* :: Text name of the State 
*   *State Code* :: 2 digit FIP code designation for the State
*   *County* :: Text name of the County 
*   *County Code* :: Five digit FIPS code designation for the State and County. 
*   *Births* :: Total number of births that occurred for that row entry, which was an aggregation of that year, state, county's birth

Although these two were not necessarily variables, they were apart of the original structure of the data. 
*   *Footer* :: At the end of each data table, a footer of text information was given to provide important information regarding changes, alterations, updates to the dataset's details. 
*   *Totals* :: These rows appeared throughout the dataset, typically a section of data had concluded such as the completion of a state's counties and it provided an aggregated total of that groups births. These total rows occurred after the end of each level of grouping/data per say, state, year, overall.

**2003-2023 Dataset:** 

*   *Notes* :: Column was used to designate special rows or changes in the data structure. 
*   *Year* :: Column contained the integer/text four digit year value. 
*   *Year Code* :: Column contained the Codified version of the year value, which was the same four digit year as before.
*   *Month* :: Text name of the Month
*   *Month Code* :: Column contained a codified version of the month's value, which would be the numerical representation of the month. 
*   *State* :: Text name of the State 
*   *State Code* :: 2 digit FIP code designation for the State
*   *County* :: Text name of the County 
*   *County Code* :: Five digit FIPS code designation for the State and County. 
*   *Births* :: Total number of births that occurred for that row entry, which was an aggregation of that year, state, county's birth

Although these two were not necessarily variables, they were apart of the original structure of the data. 
*   *Footer* :: At the end of each data table, a footer of text information was given to provide important information regarding changes, alterations, updates to the dataset's details. 
*   *Totals* :: These rows appeared throughout the dataset, typically a section of data had concluded such as the completion of a state's counties and it provided an aggregated total of that groups births. These total rows occurred after the end of each level of grouping/data per say, state, month, year, and overall. These totals different from the ones above because now the data granularity includes months. 

**##For all databases:** By the end of the cleaning process, the CDC data will contain the following variables: 

*   *year_code* :: Column contained the Codified version of the year value, which was the same four digit year as before.
*   *month_code* :: Column contained a codified version of the month's value, which would be the numerical representation of the month.  
*   *state_fipcode* :: 2 digit FIP code designation for the State
*   *county_fipcode* :: The three digit county FIP Code, C###. 
*   *fips_five* :: Five digit FIPS code designating the state and county. a
*   *Births* :: Total number of births that occurred during that month for that year, state, and county. 

The footer files that were originally apart of the data are cut off the table data and saved to their own folder inside the documents directory, /docs/cdc_data_footers. 



#### Dataset 3 :: UFO Data

Type: *CSV file*

Method: Data was captured using two methods of web scraping, Beautiful Soup and Requests. The UFO sightings data was scraped from the NUFORC website using beautiful soup to capture the session token and then the request library would then take over performing requests to the web server's backend to obtain results in batches. The batches were saved to into a dataframe and exported every 10,000 items to prevent from creating too large of a file that the system could not handle it's saving. 

Dataset variables:

*   *report_id* :: This column was originally a localized/partial link to the report's details page. I stripped the link text leaving only the numerical unique id value for the report. 
*   *occurred_on* :: This column allowed the user to input a date and time into the field regarding when they observed the sighting. 
*   *city* :: This column allowed allowed the user to input text indicating the city the sighting was observed in.
*   *state* :: This column allowed allowed the user to input text indicating the state the sighting was observed in. 
*   *country* :: This column allowed the user to input text to indicating the country the sighting was observed in. 
*   *shape* :: This column allowed the user to input text to describe the shape of the sighting.
*   *summary* :: This column allowed the user to input text providing a summary of the sighting.
*   *reported_on* :: This column appears to be a server side timestamp generated upon the incident being saved into the database by the user. 
*   *has_media* :: This column was boolean value to indicate if media (photo, video, etc) was uploaded with the sighting report
*   *explanation* :: This column allowed the user to input text providing a possible explanation for the sighting
*   *report_link* :: This column was not originally apart of the webserver's data. However, upon export of the data, I took the partial/localized link that was originally in the report id column and generated a direct link to the report's incident page.

**By the end** of the dataset cleaning process. The dataset will have the following variables: 

*   *report_id* :: The orignal report_id generated during the extraction process. 
*   *year_code* :: The codified verison of the year, four digit year value. 
*   *month_code* :: The codified version of the month, numerical representation of the month. 
*   *state_fipcode* :: The two digit FIP code for the state
*   *county_fipcode* :: The three digit FIP code for the county, C###. 
*   *city_fipcode* :: The four digit FIP code for the city. 
*   *fips_five* :: The five digit FIP code designating the state and county. 
*   *fips_nine* :: The nine digit FIP code designating the state, county, and city.

The remaining data that was apart of the original extraction was removed and archived during the cleanup process. I made sure to include the report id on other columns data that was split off the from this data (shape, summary, etc) so that the records can be linked together easily. 

Optional data storing step: You may save your raw dataset files to the local data store before moving to the next step.

Various methods utilized. 

The finalized cleaned data is stored in a SQL database, as well as in CSV and Pickle format inside the cleaned data directory. 

During the data extraction phase of the project, the data is downloaded and saved to local files to their own source folder inside the /data/raw directory. As mentioned above, the CDC Data downloads natively as a text file. The FIPS data is extracted and stored into a dataframe in memory, but to ensure that data loss does not occur, I export it as a pickle and csv file as a checkpoint. The UFO data is scrapped and saved in patcheds to CSV files as it downloads. 

Once the cleaning process begins, a copy of the data each sources data is transferred to their folder in the processing directory, /data/processed. The raw data that was downloaded/extracted is then archived through a zip utility and moved into its respective folder inside the archived directory, /data/archived. The raw folder is then emptied. From this point forward, all cleaning and processing of the data will be saved into it's source folder inside the processing directory, /data/processed. Throughout the cleaning process, checkpoints are created to create a restore point in the data. When this occurs, the data in the processing folder is typically then archived and moved to the archive folder and removed from the processing folder. There are some exceptions to this archiving, such as when the data is apart of a larger cleaning/transformation process, so the archive is held to the end to group like items. When the archive does occur, the data with a new export of the working data, thus creating a new checkpoint. Almost all checkpoints are a csv and pickle file export.  

Once all cleaning has been completed, the data is uploaded into a database file which can be found in the database folder, /data/db. All processed data is archived to the archive folder and the processed data folders are emptied. Then a copy of the uploaded data from the database is exported into a dataframe and then exported as a checkpoint, csv and pickle file, into it's source folder in the cleaned data directory, /data/cleaned. 

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Quality Issue 2:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 2: 

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [None]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes 
# are not impacted

### **Quality Issue 1: FILL IN**

In [None]:
# FILL IN - Apply the cleaning strategy

In [None]:
# FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Quality Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [3]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [4]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [5]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* Is there a correlation between UFO Sightings and Births within the United States?

In [6]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN

In [7]:
#Visual 2 - FILL IN

*Answer to research question:* FILL IN

### **5.2:** Reflection
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

*Answer:* FILL IN