# Introduction: Data Merging

This Jupyter Notebook outlines a data processing pipeline for merging multiple datasets related to Wikipedia articles, US cities, and US states by region. The ultimate goal is to create a comprehensive dataset containing information about the regional divisions, populations, Wikipedia article details, and ORES predictions for various US states.


In [None]:
# Import necessary libraries
import pandas as pd

# Step 1: Reading the datasets
In this step, we read three separate datasets: 'cleaned_data.csv', 'us_cities_by_state_SEPT.2023.csv', and 'US States by Region - US Census Bureau.xlsx'. These datasets contain crucial information about ORES predictions, US city details, and US state divisions, respectively.


In [None]:
## Reading the 'ores_predictions.csv' file
ores_df = pd.read_csv('../data/cleaned_data.csv') 

## Reading the 'us_cities_by_state_SEPT.2023.csv' file
cities_df = pd.read_csv('../data/us_cities_by_state_SEPT.2023.csv')

## Reading the 'US States by Region - US Census Bureau.xlsx' file
regions_df = pd.read_excel('../data/US States by Region - US Census Bureau.xlsx')



# Step 2: Data Preprocessing
This step involves necessary data preprocessing tasks. We extract the 'State' column from the 'us_cities_by_state_SEPT.2023.csv' dataset. Additionally, we merge the 'ores_df' and 'cities_df' dataframes to combine relevant information from both datasets.


In [None]:
## Extracting the 'State' column from the cities dataframe
cities_df = cities_df[['page_title', 'State']]

## Merging the 'ores_df' and 'cities_df' on the 'Title' and 'page_title' columns
merged_df = pd.merge(ores_df, cities_df, left_on='Title', right_on='page_title', how='inner')



# Step 3: Merging the Dataframes
Here, we merge the previously combined dataframe with the 'regions_df' dataframe based on the 'State' column. This results in a comprehensive dataframe containing data from all three initial datasets, providing a holistic view of the US states, their regions, and corresponding Wikipedia article details.


In [None]:
## Merging the 'merged_df' and 'regions_df' on the 'State' column
final_df = pd.merge(merged_df, regions_df, left_on='State', right_on='STATE', how='inner')



# Step 4: Selecting the Required Columns
To streamline the dataset, we select only the necessary columns, including 'State', 'DIVISION', 'article_title', 'Last_Revision_ID', and 'Prediction'. This ensures that the resulting dataset remains focused on the essential information for further analysis.


In [None]:
## Selecting the necessary columns for the final dataset
final_df = final_df[['State', 'DIVISION', 'article_title', 'Last_Revision_ID', 'Prediction']]



# Step 5: Renaming the Columns
To improve the readability of the final dataset, we rename the columns to more intuitive and descriptive names, providing a clearer understanding of the data contained within the dataset.


In [None]:
## Renaming the columns for better readability
final_df.columns = ['state', 'regional_division', 'article_title', 'revision_id', 'article_quality']



# Step 6: Saving the Final Dataset
Finally, we save the resulting dataset to a CSV file named 'resulting_data.csv'. This file contains all the essential information merged from the initial datasets, offering valuable insights into US state regional divisions, populations, Wikipedia article details, and ORES predictions.


In [None]:
## Saving the resulting data to a CSV file
final_df.to_csv('../data/resulting_data.csv', index=False)

# Display the first few rows of the final merged dataset
final_df.head()
