#  Predicting the opening of Costco stores based on hearing aids accessibility

Using Logistic Regression as a statistical method for predicting binary outcomes from data. In this case: "yes there is a costco here" vs "no there is no a costco here"

These are categories that translate to probability of being a 0 or a 1

Logistic regression predicts binary outcomes, meaning that there are only two possible outcomes. For this analysis, we would find what features in our data set seem to predict that there is a costco and use that to built the model. Multiple variables have to be taken in consideration, such as an costco's location, demographics, education and income, these will be assessed to arrive at one of two answers: yes there is a costco and no there is no a costco in this location. In other words, this logistic regression model will analyze the available data, and when presented with a new sample, mathematically determines its probability of belonging to a class. The testing set we would try and predict where they would or would not have a costco. If the probability is above a certain cutoff point, let's say as an example 70%, the sample is assigned to that class. If the probability is less than the cutoff point, the sample is assigned to the other class. 

Let's summarize the steps we took to use a logistic regression model: Create a model with LogisticRegression(). Train the model with model.fit(). Make predictions with model.predict(). Validate the model with accuracy_score().



In [1]:
#Import libraries
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd

import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter

In [2]:
# Start a SparkSession
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CloudETL").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

22/11/03 20:44:41 WARN Utils: Your hostname, Claudias-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.0.3 instead (on interface en0)
22/11/03 20:44:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/03 20:44:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/03 20:44:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
# Read in data from S3 Buckets - the "Database" system we are using
from pyspark import SparkFiles

# Cleaned ACS Data
url="https://gwufphearingaids.s3.amazonaws.com/ml_data.csv" 
spark.sparkContext.addFile(url)
ml_data_df = spark.read.csv(SparkFiles.get("ml_data.csv"), sep=",", header=True, inferSchema=True)

# Show DataFrame
#ml_data_df.show()
ml_data_df = ml_data_df.toPandas()

                                                                                

22/11/03 20:45:02 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/11/03 20:45:02 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , ZIP, CITY_x, STATE_x, 2019_Total_Population:_20_to_24_Years, 2019_Total_Population:_25_to_29_Years, 2019_Total_Population:_30_to_34_Years, 2019_Total_Population:_35_to_39_Years, 2019_Total_Population:_40_to_44_Years, 2019_Total_Population:_45_to_49_Years, 2019_Total_Population:_50_to_54_Years, 2019_Total_Population:_55_to_59_Years, 2019_Total_Population:_60_to_64_Years, 2019_Total_Population:_65_to_69_Years, 2019_Population_Age_25+_by_Educational_Attainment_Base, 2019_Pop_Age_25+:_No_Schooling, 2019_Pop_Age_25+:_9th-12th_(No_Diploma), 2019_Pop_Age_25+:_High_School_Diploma, 2019_Pop_Age_25+:_GED/Alternative_Credential, 2019_Pop_Age_25+:_Some_College/No_Degree, 2019_Pop_Age_25+:_Associate_Degree, 2019_Pop_Age_25+:_Bac

                                                                                

In [5]:
columns = ['ZIP', 
           '2019_Total_Population:_20_to_24_Years', '2019_Total_Population:_25_to_29_Years', 
           '2019_Total_Population:_30_to_34_Years', '2019_Total_Population:_35_to_39_Years', 
           '2019_Total_Population:_40_to_44_Years', '2019_Total_Population:_45_to_49_Years', 
           '2019_Total_Population:_50_to_54_Years', '2019_Total_Population:_55_to_59_Years', 
           '2019_Total_Population:_60_to_64_Years', '2019_Total_Population:_65_to_69_Years',
           '2019_Population_Age_25+_by_Educational_Attainment_Base',
           '2019_Pop_Age_25+:_No_Schooling', '2019_Pop_Age_25+:_9th-12th_(No_Diploma)', 
           '2019_Pop_Age_25+:_High_School_Diploma', '2019_Pop_Age_25+:_GED/Alternative_Credential', 
           '2019_Pop_Age_25+:_Some_College/No_Degree', '2019_Pop_Age_25+:_Associate_Degree',
           '2019_Pop_Age_25+:_Bachelors_Degree', '2019_Pop_Age_25+:_Graduate_Degree',
           '2019_Pop_Age_25+:_Professional_School_Degree', '2019_Pop_Age_25+:_Doctorate_Degree',
           '2019_Median_HH_Income:_HHr_Age_15-24', '2019_Median_HH_Income:_HHr_Age_25-44', 
           '2019_Median_HH_Income:_HHr_Age_45-64', '2019_Median_HH_Income:_HHr_Age_65+', 
           '2021_Median_Household_Income', '2026_Median_Household_Income', 
           '2021_Average_Household_Income', '2026_Average_Household_Income', 
           '2021_2026_Population:_Compound_Annual_Growth_Rate'
           ]

target = ["COSTCO_HEARING_CENTER"]

In [6]:
ml_data_df.head()

Unnamed: 0,_c0,ZIP,CITY_x,STATE_x,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
0,0,1001,Agawam,MA,650.0,1027.0,1153.0,773.0,1217.0,1038.0,...,27747.0,78863.0,81899.0,45841.0,68392.0,76174.0,87293.0,98284.0,0.08,
1,1,1002,Amherst,MA,11249.0,2023.0,1344.0,1017.0,951.0,1012.0,...,26373.0,55775.0,98699.0,77235.0,63315.0,71008.0,93933.0,104369.0,0.27,
2,2,1003,Amherst,MA,2734.0,31.0,0.0,7.0,0.0,0.0,...,5000.0,0.0,0.0,87500.0,7500.0,7500.0,12124.0,13505.0,0.0,
3,3,1005,Barre,MA,414.0,213.0,310.0,121.0,149.0,475.0,...,5000.0,79869.0,84112.0,51184.0,77915.0,88169.0,107888.0,123463.0,0.52,
4,4,1007,Belchertown,MA,1252.0,597.0,755.0,987.0,823.0,1113.0,...,28089.0,100454.0,112125.0,54294.0,97576.0,104725.0,115051.0,128399.0,0.53,


In [7]:
#Read Data

ml_data_df=ml_data_df.drop(ml_data_df.columns[[0, 2, 3]], axis = 1)

# Drop the null columns where all values are null
ml_datadf = ml_data_df.dropna(axis='columns', how='all')

# Drop the null rows
ml_datadf = ml_data_df.dropna()

ml_data_df.head()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
0,1001,650.0,1027.0,1153.0,773.0,1217.0,1038.0,1331.0,1259.0,1175.0,...,27747.0,78863.0,81899.0,45841.0,68392.0,76174.0,87293.0,98284.0,0.08,
1,1002,11249.0,2023.0,1344.0,1017.0,951.0,1012.0,1273.0,1241.0,1212.0,...,26373.0,55775.0,98699.0,77235.0,63315.0,71008.0,93933.0,104369.0,0.27,
2,1003,2734.0,31.0,0.0,7.0,0.0,0.0,0.0,4.0,0.0,...,5000.0,0.0,0.0,87500.0,7500.0,7500.0,12124.0,13505.0,0.0,
3,1005,414.0,213.0,310.0,121.0,149.0,475.0,368.0,749.0,493.0,...,5000.0,79869.0,84112.0,51184.0,77915.0,88169.0,107888.0,123463.0,0.52,
4,1007,1252.0,597.0,755.0,987.0,823.0,1113.0,1416.0,969.0,1236.0,...,28089.0,100454.0,112125.0,54294.0,97576.0,104725.0,115051.0,128399.0,0.53,


In [8]:
#ml_data_df.loc[ml_data_df['COSCTO_HEARING_CENTER']=='1']

In [9]:
ml_data_df.COSTCO_HEARING_CENTER.unique()

array([None, 'Yes', 'No'], dtype=object)

In [10]:
ml_data_df.tail()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
32114,725,,,,,,,,,,...,,,,,,,,,,Yes
32115,924,,,,,,,,,,...,,,,,,,,,,Yes
32116,959,,,,,,,,,,...,,,,,,,,,,Yes
32117,926,,,,,,,,,,...,,,,,,,,,,No
32118,961,,,,,,,,,,...,,,,,,,,,,Yes


In [11]:
ml_data_df['COSTCO_HEARING_CENTER'].replace('None', np.nan, inplace=True)

In [12]:
ml_data_df['COSTCO_HEARING_CENTER'].replace(np.nan, '0')
ml_data_df.head()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
0,1001,650.0,1027.0,1153.0,773.0,1217.0,1038.0,1331.0,1259.0,1175.0,...,27747.0,78863.0,81899.0,45841.0,68392.0,76174.0,87293.0,98284.0,0.08,
1,1002,11249.0,2023.0,1344.0,1017.0,951.0,1012.0,1273.0,1241.0,1212.0,...,26373.0,55775.0,98699.0,77235.0,63315.0,71008.0,93933.0,104369.0,0.27,
2,1003,2734.0,31.0,0.0,7.0,0.0,0.0,0.0,4.0,0.0,...,5000.0,0.0,0.0,87500.0,7500.0,7500.0,12124.0,13505.0,0.0,
3,1005,414.0,213.0,310.0,121.0,149.0,475.0,368.0,749.0,493.0,...,5000.0,79869.0,84112.0,51184.0,77915.0,88169.0,107888.0,123463.0,0.52,
4,1007,1252.0,597.0,755.0,987.0,823.0,1113.0,1416.0,969.0,1236.0,...,28089.0,100454.0,112125.0,54294.0,97576.0,104725.0,115051.0,128399.0,0.53,


In [13]:
ml_data_df.COSTCO_HEARING_CENTER.unique()

array([None, 'Yes', 'No'], dtype=object)

In [14]:
ml_data_df.tail()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
32114,725,,,,,,,,,,...,,,,,,,,,,Yes
32115,924,,,,,,,,,,...,,,,,,,,,,Yes
32116,959,,,,,,,,,,...,,,,,,,,,,Yes
32117,926,,,,,,,,,,...,,,,,,,,,,No
32118,961,,,,,,,,,,...,,,,,,,,,,Yes


In [15]:
ml_data_df['COSTCO_HEARING_CENTER'] = np.where(ml_data_df['COSTCO_HEARING_CENTER']!='Yes', '0' ,'Yes')
ml_data_df['COSTCO_HEARING_CENTER'] = np.where(ml_data_df['COSTCO_HEARING_CENTER']!='Yes', 0 ,1)

In [16]:
ml_data_df.COSTCO_HEARING_CENTER.unique()

array([0, 1])

In [17]:
ml_data_df.head()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
0,1001,650.0,1027.0,1153.0,773.0,1217.0,1038.0,1331.0,1259.0,1175.0,...,27747.0,78863.0,81899.0,45841.0,68392.0,76174.0,87293.0,98284.0,0.08,0
1,1002,11249.0,2023.0,1344.0,1017.0,951.0,1012.0,1273.0,1241.0,1212.0,...,26373.0,55775.0,98699.0,77235.0,63315.0,71008.0,93933.0,104369.0,0.27,0
2,1003,2734.0,31.0,0.0,7.0,0.0,0.0,0.0,4.0,0.0,...,5000.0,0.0,0.0,87500.0,7500.0,7500.0,12124.0,13505.0,0.0,0
3,1005,414.0,213.0,310.0,121.0,149.0,475.0,368.0,749.0,493.0,...,5000.0,79869.0,84112.0,51184.0,77915.0,88169.0,107888.0,123463.0,0.52,0
4,1007,1252.0,597.0,755.0,987.0,823.0,1113.0,1416.0,969.0,1236.0,...,28089.0,100454.0,112125.0,54294.0,97576.0,104725.0,115051.0,128399.0,0.53,0


In [18]:
ml_data_df.tail()

Unnamed: 0,ZIP,2019_Total_Population:_20_to_24_Years,2019_Total_Population:_25_to_29_Years,2019_Total_Population:_30_to_34_Years,2019_Total_Population:_35_to_39_Years,2019_Total_Population:_40_to_44_Years,2019_Total_Population:_45_to_49_Years,2019_Total_Population:_50_to_54_Years,2019_Total_Population:_55_to_59_Years,2019_Total_Population:_60_to_64_Years,...,2019_Median_HH_Income:_HHr_Age_15-24,2019_Median_HH_Income:_HHr_Age_25-44,2019_Median_HH_Income:_HHr_Age_45-64,2019_Median_HH_Income:_HHr_Age_65+,2021_Median_Household_Income,2026_Median_Household_Income,2021_Average_Household_Income,2026_Average_Household_Income,2021_2026_Population:_Compound_Annual_Growth_Rate,COSTCO_HEARING_CENTER
32114,725,,,,,,,,,,...,,,,,,,,,,1
32115,924,,,,,,,,,,...,,,,,,,,,,1
32116,959,,,,,,,,,,...,,,,,,,,,,1
32117,926,,,,,,,,,,...,,,,,,,,,,0
32118,961,,,,,,,,,,...,,,,,,,,,,1


In [19]:
ml_data_df.replace(np.nan, 0, inplace=True)

In [20]:
# Check which columns are non integer/float 
ml_data_df.dtypes[ml_data_df.dtypes != 'int64'][ml_data_df.dtypes != 'float64']

ZIP    int32
dtype: object

# Separate the Features (X) from the Target (y)

In [21]:
y = ml_data_df["COSTCO_HEARING_CENTER"]
X = ml_data_df.drop(columns="COSTCO_HEARING_CENTER")

# Split our data into training and testing

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape

(24089, 31)

# Create a Logistic Regression Model

In [23]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)


# Fit (train) or model using the training data

In [24]:
# Train the data
classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Make predictions

In [25]:
# Predict outcomes for test data set
y_pred = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(50)

Unnamed: 0,Prediction,Actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0


# Validate the model using the test data

In [26]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.9801992528019925


# OTHER MACHINE LEARNING MODELS

# Chi-Squared Model

In [35]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

8 selected features


# Pearson Correlation

In [36]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

8 selected features
