# Analyzing US Population and Wikipedia Articles for Cities
This Jupyter notebook analyzes demographic data related to US cities and Wikipedia articles associated with these cities.

## Load Data from CSV

In [42]:
# Import necessary libraries and load the data
import pandas as pd
fp = 'data/wp_scored_city_articles_by_state.csv'
data = pd.read_csv(fp)

data.head()

Unnamed: 0,population,state,regional_division,state.1,article_title,revision_id,article_quality
0,5074296,Alabama,South_East South Central,Alabama,"Abbeville, Alabama",1171163550,C
1,5074296,Alabama,South_East South Central,Alabama,"Adamsville, Alabama",1177621427,C
2,5074296,Alabama,South_East South Central,Alabama,"Addison, Alabama",1168359898,C
3,5074296,Alabama,South_East South Central,Alabama,"Akron, Alabama",1165909508,GA
4,5074296,Alabama,South_East South Central,Alabama,"Alabaster, Alabama",1179139816,C


## Analysis

### 1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order)

In this code, we first group the data by state and calculate coverage per capita by dividing the count of articles by the mean population. We then sort the states based on coverage per capita, create a DataFrame with the sorted values, and display the top 10 states with the highest coverage per capita. The code provides insights into the top-performing US states in terms of coverage per capita for the given dataset.

In [43]:
# Group by state and calculate coverage per capita
statewise_count = data.groupby('state')['article_title'].count()
statewise_pop = data.groupby('state')['population'].mean()
statewise_coverage_per_capita = statewise_count / statewise_pop

# Sort by coverage per capita in descending order
top_state_coverage = statewise_coverage_per_capita.sort_values(ascending=False)

# Create a DataFrame and rename columns
top_state_coverage = top_state_coverage.to_frame(name='Total Articles per Capita')

# Display the top 10 states by coverage per capita
top_10_state_coverage = top_state_coverage.head(10)

print("Top 10 US States by Coverage")
print(top_10_state_coverage)

Top 10 US States by Coverage
              Total Articles per Capita
state                                  
Vermont                        0.000508
North Dakota                   0.000457
Maine                          0.000349
South Dakota                   0.000342
Iowa                           0.000326
Alaska                         0.000203
Pennsylvania                   0.000197
Alabama                        0.000182
Michigan                       0.000177
Wyoming                        0.000170


### 2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order)

We first sort the DataFrame statewise_coverage_per_capita in ascending order using sort_values. Next, we extract the top 10 rows (states with the lowest 'Total Articles per Capita') using head(10). We rename the column to 'Total Articles per Capita' for clarity. Finally, we print the top 10 states with the lowest 'Total Articles per Capita'.

In [44]:
# Sort the DataFrame by values
statewise_coverage_per_capita = statewise_coverage_per_capita.sort_values()

# Get the bottom 10 states
bottom_state_coverage = statewise_coverage_per_capita.head(10).to_frame().rename(columns={0: 'Total Articles per Capita'})

print("Bottom 10 US States by Coverage")
print(bottom_state_coverage)


Bottom 10 US States by Coverage
                Total Articles per Capita
state                                    
North Carolina                   0.000005
Nevada                           0.000006
California                       0.000012
Arizona                          0.000012
Florida                          0.000019
Oklahoma                         0.000019
Kansas                           0.000021
Maryland                         0.000025
Virginia                         0.000031
Wisconsin                        0.000033


### 3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order)

The code filters high-quality articles (FA and GA) from the given dataset, calculates the coverage per capita for each state based on the mean population, and then identifies the top 10 states with the highest coverage per capita of high-quality articles.

In [45]:
# Filter high-quality articles (FA and GA)
hq_articles = data[data['article_quality'].isin(['FA', 'GA'])]

# Group by state and calculate coverage per capita
hq_statewise_coverage_per_capita = hq_articles.groupby('state').size() / hq_articles.groupby('state')['population'].mean()

# Sort the coverage per capita in descending order
hq_top_state_coverage = hq_statewise_coverage_per_capita.sort_values(ascending=False).head(10)

print("Top 10 US States by High Quality")
print(hq_top_state_coverage)


Top 10 US States by High Quality
state
Vermont          0.000070
Wyoming          0.000067
South Dakota     0.000062
West Virginia    0.000060
Montana          0.000049
New Hampshire    0.000045
Pennsylvania     0.000044
Missouri         0.000043
Alaska           0.000042
New Jersey       0.000041
dtype: float64


### 4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order)

The below code first sorts the DataFrame hq_statewise_coverage_per_capita in ascending order. It then creates a new DataFrame hq_bottom_state_coverage with the sorted data and renames the column to 'Total Articles per Capita'. Finally, it displays the top 10 rows of this new DataFrame.

In [46]:
# Sort the DataFrame in-place
hq_statewise_coverage_per_capita.sort_values(inplace=True)

# Create a DataFrame with renamed column
hq_bottom_state_coverage = hq_statewise_coverage_per_capita.to_frame(name='Total Articles per Capita')

print("Bottom 10 US States by High Quality")
print(hq_bottom_state_coverage.head(10))


Bottom 10 US States by High Quality
                Total Articles per Capita
state                                    
North Carolina                   0.000002
Nevada                           0.000003
Arizona                          0.000003
Virginia                         0.000004
California                       0.000004
Florida                          0.000005
New York                         0.000006
Maryland                         0.000007
Kansas                           0.000007
Oklahoma                         0.000008


### 5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita

This code first calculates the mean of numeric columns per state and associates the regional division with each state. It then counts the number of articles per state, merges the data, and calculates the article coverage per capita for each region. Finally, it sorts the coverage values in descending order and assigns ranks to each region based on coverage.

In [47]:
# Calculate mean of numeric columns per state
mean_state_data = data.groupby('state').mean(numeric_only=True)
mean_state_data.reset_index(inplace=True)

# Retrieve regional division for each state
state_regions = [data[data['state'] == st].iloc[0]['regional_division'] for st in mean_state_data['state']]
mean_state_data['regional_division'] = state_regions

# Count the number of articles per state
article_count_per_state = data.groupby('state').count()
article_count_per_state.reset_index(inplace=True)
article_count_per_state.drop(['regional_division', 'population', 'revision_id', 'article_quality'], axis=1, inplace=True)

# Merge mean data with article count per state
merged_state_data = pd.merge(mean_state_data, article_count_per_state, left_on='state', right_on='state', how='inner')

# Calculate region-wise article coverage per capita
region_population = merged_state_data.groupby('regional_division')['population'].sum()
region_article_count = merged_state_data.groupby('regional_division')['article_title'].sum()
regionwise_coverage_per_capita = region_article_count / region_population

# Sort the coverage values in descending order
regionwise_coverage_per_capita.sort_values(ascending=False, inplace=True)

# Create a DataFrame for region-wise coverage per capita
regionwise_coverage = regionwise_coverage_per_capita.to_frame()
regionwise_coverage = regionwise_coverage.rename(columns={0: 'Total_Articles_per_Capita'})
regionwise_coverage['Rank'] = range(1, len(regionwise_coverage) + 1)

print("Census Divisions by Total Coverage")
print(regionwise_coverage)

Census Divisions by Total Coverage
                            Total_Articles_per_Capita  Rank
regional_division                                          
Midwest_West North Central                   0.000181     1
Northeast_New England                        0.000125     2
South_East South Central                     0.000102     3
Midwest_East North Central                   0.000101     4
Northeast_Middle Atlantic                    0.000090     5
South_West South Central                     0.000051     6
West_Mountain                                0.000047     7
South_South Atlantic                         0.000030     8
West_Pacific                                 0.000024     9


### 6. Census Divisions by High Quality Coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita

We first calculate the mean values and extract the regional division for high-quality articles per state. We then calculate the count of high-quality articles per state and merge these datasets. Next, we calculate region-wise high-quality article counts and coverage per capita. Finally, we sort the coverage per capita and present it in a DataFrame with the respective ranks.

In [48]:
# Calculate mean values for each state in the hq_articles DataFrame
state_mean_quality = hq_articles.groupby('state').mean(numeric_only=True).reset_index()

# Extract the regional division for each state
state_regions = hq_articles.groupby('state')['regional_division'].first().reset_index()
state_mean_quality['regional_division'] = state_regions['regional_division']

# Count the number of articles for each state
state_article_count = hq_articles.groupby('state').count().reset_index()

# Drop unnecessary columns
state_article_count.drop(['regional_division', 'population', 'revision_id', 'article_quality'], axis=1, inplace=True)

# Merge the mean quality and article count DataFrames
state_quality_count = pd.merge(state_mean_quality, state_article_count, on='state', how='inner')

# Calculate the total articles per capita for each regional division
regionwise_article_count = state_quality_count.groupby('regional_division')['article_title'].sum()
regionwise_population = state_quality_count.groupby('regional_division')['population'].sum()
regionwise_coverage_per_capita = regionwise_article_count / regionwise_population

# Sort the regional divisions by coverage per capita
regionwise_coverage_per_capita = regionwise_coverage_per_capita.sort_values(ascending=False)

# Create a DataFrame to store regional division-wise coverage and rank
regionwise_coverage = regionwise_coverage_per_capita.to_frame().reset_index()
regionwise_coverage = regionwise_coverage.rename(columns={0: 'Total Articles per Capita'})
regionwise_coverage['Rank'] = range(1, len(regionwise_coverage) + 1)

print("Census Divisions by High Quality Coverage")
print(regionwise_coverage)

Census Divisions by High Quality Coverage
            regional_division  Total Articles per Capita  Rank
0  Midwest_West North Central                   0.000033     1
1   Northeast_Middle Atlantic                   0.000025     2
2       Northeast_New England                   0.000020     3
3    South_East South Central                   0.000019     4
4  Midwest_East North Central                   0.000015     5
5    South_West South Central                   0.000015     6
6               West_Mountain                   0.000014     7
7                West_Pacific                   0.000009     8
8        South_South Atlantic                   0.000008     9
