# Order Of Operations

### **IMPORTANT / READ FIRST**: 

#### How to: 

Starting at Step 1: 

To run each of these notebook, please perform one of the following: 

- Use the "Run All" option under the Run menu in the tool bar.
- Open the notebook and go to the last cell in the notebook and select it. Then perform the "Run All Above Selected Cell" option from the Run Menu in the toolbar.
- Open the notebook, select the first cell, and then run each cell individually in order. This is best done by using the keyboard short-cut "Shift + Enter". This will run the select cell and then select the next cell in the notebook.

At the end of every notebook, there is a markdown cell that will link you to the notebook for the next step or you may use the link to the landing to comeback to this landing page. 

# **DO NOT RUN OUT OF ORDER** 

### Step 0: Checking for Required "folder_paths.json" file

In [22]:
# Import OS 
import os

file_path = "../data/folder_paths.json"  # Replace with the actual file path

if os.path.exists(file_path):
    print("File exists")
    print("Proceed with Step 1")
else:
    print("File does not exist")
    print("Running folder_manager.py file from d497_helpers module")
    %run "../d497_helpers/folder_manager.py"
    print("Proceed with Step 1")
    

File exists
Proceed with Step 1


### Step 1: [Data Extraction - FIPS Data](data_extraction_fips.ipynb) 

FIPS Data is obtained from the United States Department of Transportation through API access. The data is captrued into a dataframe then exported as a CSV file.

### Step 2: [Data Extraction - UFO Data](data_extraction_ufo.ipynb)

UFO Sighting Data is captured from the NUFORC website through two methods of web scraping, Beautiful Soup and Requests. Beautiful soup is utilized to capture the session token from the web site, and then requests makes multiple calls to the website's backend to capture the data in loop. The data is saved to captured into a dataframe then exported as a csv file. 

### Step 3: [CDC Data Extraction - CDC Data](data_extraction_cdc.ipynb)

CDC Data is captured from the CDC Wonder web portal tool using web automation tool, Selenium, to automate the generation of text file downloads. 

### Step 4: [Data Cleaning - FIPS Data](data_cleaning_fips_main.ipynb)

FIPS data is prepared for load into the SQL database here. The data is checked for null values, U.S. States, and transformed to fit a more unified format for this project. 

### Step 5: [Data Load - FIPS Data](data_load_fips.ipynb)

The FIPS data is loaded into a SQL database here. 

### Step 6: [Data Transformation - State Lookup Data](data_transformation_state_lookup.ipynb)

The State Loopkup table is generated here. FIPs Data from the SQL databased is exported and transformed into a table that will be utilzied for vectorized searching, which will allow for faster and more efficicent matching of the State Name or State Abbreviation to the State's FIP Code

### Step 7: [Data Transformation - County City Lookup Data](data_transformation_city_lookup.ipynb)

The County and City Lookup table is made here. The method is the same as above with the state lookup table, but this time focusing on creating a table that will allow for the matching of City names to the County. 

### Step 8: [Data Cleaning - CDC Data](data_cleaning_cdc_main.ipynb)

The CDC Data is cleaned here to be prepared for import into the SQL database. 

The CDC data undergoes several main cleanup operations. The first of this would be the splitting of the informational footer from the original text files. Next the 'total' rows are removed from each file as well. The third main change that occurs is for the years from 1995 to 2002. The database that contains this information did not store the data in monthly totals but yearly totals. To correct this uniformity, the monthly data is imputed by splitting the yearly into monthly evenly. 

Once the imputing is finished the data is then recombined and converted down to a more uniformed format. 

### Step 9: [Data Load - CDC Data](data_load_cdc.ipynb)

CDC Data is loaded into the SQL database here. 

### Step 10: [Data Transformation - Removal of Unspecified Counties from CDC Data ](data_transformation_cdc_data_unspecified_county_removal.ipynb)

The initial CDC data contained records for births that were not linked to a specific county, these were originally stored as Unspecified Counties. This notebook removes those by evenly distributes the data into each county for the state. The data is exported from the datbase, transformed, and imported back into the database as a new table.

### Step 11: [Data Cleaning - UFO](data_cleaning_ufo_main.ipynb)

The UFO data cleaning is performed here. This notebook is extensive. 

The data is stripped of any additioanl whitespaces, new lines, etc. From here the data is then split into the columns that will be utilized for this report and those that will not be. To prevent from needed to perform all the extraction again, the data that is not utilized is zipped and archived for potential future use. Once the data has been stripped and split, it is combined into from it's multiple file into a single file. 

The data then is then transformed over a series of steps into a more unified format to be used for analysis. The majority of these transformations occurs when working with the city data. The CDC Data is only goes down to County level granularity. While the UFO data allowed for users to input City, State, and Country, all of which was not restrictive in their entry. Due to these differences in granularity and strictness of data accuracy. Multiple passes are necessary to begin cleaning up and matching the data to valid locations. The first of these passes was to remove non-United States countries. Next, the states are matched using the State Lookup table made previously. After this the values in the city column have to be cleaned and matched the county where they reside to equalize the granularity of the data. The city data is stripped of all rows that contain special characters. Next the dataset is merged with County City Lookup table created before. Duplicates are removed for entires that have multiple counties for each city. Then non-matched entries are then checked row by row against the County and City Lookup table for individually. The rows that are then left after this that are not matched are run through a third pass to match the row value to the county and city lookup table using a fuzzy search tool called RapidFuzz. The remaining non-matched files are manually reviewed and updated or filtered out of the database due to invalid data. After the main data base been matched, the special character data filtered previously is then ran through the same fuzzy match tool to attempt to match the entires against the county and city loopkup table. Finally the data is joined back together. 

### Step 12: [Data Load - UFO](data_load_ufo.ipynb)

UFO Data is loaded into the database here. 

### Step 13: [Data Transformation - UFO Data Aggregation](data_transformation_ufo_data_aggregation.ipynb)

The UFO data is exported and aggregated from single entry occurrences into a total groups for each year, month, state, and county like the CDC Data is.

### Step 14: [Data Analysis](data_analysis_main.ipynb)

Data Analysis occurs here, please run all the fields and then go to the project start file. 


### Step 15: [Data Wrangling Project Starter Notebook](data_wrangling_project_starter.ipynb)

Answer all the academic required questions and provides the response to my analysis. 