#  Predicting the opening of Costco stores based on hearing aids accessibility

Using Logistic Regression as a statistical method for predicting binary outcomes from data. In this case: "yes there is a costco here" vs "no there is no a costco here"

These are categories that translate to probability of being a 0 or a 1

Logistic regression predicts binary outcomes, meaning that there are only two possible outcomes. For this analysis, we would find what features in our data set seem to predict that there is a costco and use that to built the model. Multiple variables have to be taken in consideration, such as an costco's location, demographics, education and income, these will be assessed to arrive at one of two answers: yes there is a costco and no there is no a costco in this location. In other words, this logistic regression model will analyze the available data, and when presented with a new sample, mathematically determines its probability of belonging to a class. The testing set we would try and predict where they would or would not have a costco. If the probability is above a certain cutoff point, let's say as an example 70%, the sample is assigned to that class. If the probability is less than the cutoff point, the sample is assigned to the other class. 

Let's summarize the steps we took to use a logistic regression model: Create a model with LogisticRegression(). Train the model with model.fit(). Make predictions with model.predict(). Validate the model with accuracy_score().



In [2]:
#Import libraries

from path import Path
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Read in data from S3 Buckets - the "Database" system we are using
from pyspark import SparkFiles

# Cleaned ACS Data
url="https://<bucket name>.s3.amazonaws.com/cleaned_acs_demographics.csv" 
spark.sparkContext.addFile(url)
cleaned_acs_demographics_df = spark.read.csv(SparkFiles.get("cleaned_acs_demographics.csv"), sep=",", header=True, inferSchema=True)

# Show DataFrame
cleaned_acs_demographics_df.show()

In [None]:
#Cleaned Costco Data
url="https://<bucket name>.s3.amazonaws.com/cleaned_costco.csv" 
spark.sparkContext.addFile(url)
cleaned_demographic_df = spark.read.csv(SparkFiles.get("cleaned_costco.csv"), sep=",", header=True, inferSchema=True)

# Show DataFrame
cleaned_costco_df.show()


In [None]:
# Join the two DataFrame
joined_df= cleaned_acs_demographics_df.join(cleaned_costco_df, on="zip code", how="inner")
joined_df.show()

# Separate the Features (X) from the Target (y)

In [None]:
y = df["Feature(s)"]
X = df.drop(columns="Feature (s)")

# Split our data into training and testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape

# Create a Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)


# Fit (train) or model using the training data

In [None]:
# Train the data
classifier.fit(X_train, y_train)

# Make predictions

In [None]:
# Predict outcomes for test data set
y_pred = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head()

# Validate the model using the test data

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))