# **Basic State Population Predictor**
### Author: JJ McCauley
Serving as a basic introductory exercise, this program aims to predict the future population of the US by scaping past data, using a simple linear regression model to make a prediction, then visually modeling the data.

In [15]:
# Storing online table
import pandas as pd
import numpy as np
import ssl 
# Linear Regression Model imports
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

## Scraping Population Data
First, we will be using the pandas library to scrape the population data off of the web.

In [55]:
# Cleaning Functions (defined as function for ease)
def convert_percentage_to_float(x):
    if isinstance(x, str) and x.endswith('%'):
        return float(x.rstrip('%'))
    return x

#Retrive the data from the website
url = 'https://www.census.gov/data/tables/time-series/dec/popchange-data-text.html'
ssl._create_default_https_context = ssl._create_unverified_context
scraper = pd.read_html(url)
#Saving the scraper as a pandas dataframe
df = scraper[0]
#Cleaning the data
df = df.applymap(lambda x: x.replace(' ', '_') if isinstance(x, str) else x) #replace spaces with underscores
df = df.applymap(convert_percentage_to_float) #replace spaces with underscores
df['State or Region'] = df['State or Region'].str.upper()
print("---Population Data Loaded---")
print(df)

---Population Data Loaded---
         State or Region    2020 Census    2010 Census    2000 Census  \
0          UNITED_STATES  United_States  United_States  United_States   
1    RESIDENT_POPULATION      331449281      308745538      281421906   
2         PERCENT_CHANGE            7.4            9.7           13.2   
3              NORTHEAST      Northeast      Northeast      Northeast   
4    RESIDENT_POPULATION       57609148       55317240       53594378   
..                   ...            ...            ...            ...   
166  RESIDENT_POPULATION         576851         563626         493782   
167       PERCENT_CHANGE            2.3           14.1            8.9   
168          PUERTO_RICO    Puerto_Rico    Puerto_Rico    Puerto_Rico   
169  RESIDENT_POPULATION        3285874        3725789        3808610   
170       PERCENT_CHANGE          -11.8           -2.2            8.1   

       1990 Census    1980 Census    1970 Census    1960 Census  \
0    United_States  United_

  df = df.applymap(lambda x: x.replace(' ', '_') if isinstance(x, str) else x) #replace spaces with underscores
  df = df.applymap(convert_percentage_to_float) #replace spaces with underscores


## Creating the Linear Regression Model & Visualizing
Create a Linear Regression Model using the sklearn, then visualize using pandas library. 
The features will be:
- resident population 
- percent change.


In [53]:
#Take in the dataframe of the state's population and population change as arguments
def Model_and_Visualize(df):
    x = df[['PERCENT_CHANGE']].values
    y = df[['RESIDENT_POPULATION']].values
    model = LinearRegression()
    model.fit(x, y)
    y_pred = model.predict(x)
    print("Prediction: " + y_pred)
    

## Receiving State Input From the User
Lastly, we will ask the user for a state and find it in the pandas dataframe. We will then call our relevant functions to visualize and predict the next resident population.

In [54]:
state_to_find = input("Enter the state/region to calculate: ")
state_to_find = state_to_find.upper()
#Looping until the user would like to quit
while(state_to_find != 'Q'): 
    row_indicies, col_indicies = np.where(df.values == state_to_find) #Finding the element
    if len(row_indicies) > 0: #If the element was found
        row_index = row_indicies[0]
        state_df = df.iloc[row_index + 1: row_index + 3].copy() #Save the data in a df as a copy
        print(state_df)
        #Run the Linear Regression Model
        Model_and_Visualize(state_df)
    else:
        print("Invalid Input")
    
    state_to_find = input("Enter the state/region to calculate (Q to quit): ")
    state_to_find = state_to_find.upper()
       

        State or Region 2020 Census 2010 Census 2000 Census 1990 Census  \
76  RESIDENT_POPULATION     6177224     5773552     5296486     4781468   
77       PERCENT_CHANGE         7.0         9.0        10.8        13.4   

   1980 Census 1970 Census 1960 Census 1950 Census 1940 Census 1930 Census  \
76     4216975     3922399     3100689     2343001     1821244     1631526   
77         7.5        26.5        32.3        28.6        11.6        12.5   

   1920 Census 1910 Census  
76     1449661     1295346  
77        11.9         9.0  


KeyError: "None of [Index(['PERCENT_CHANGE'], dtype='object')] are in the [columns]"