# Part I: Research Question

### A1. 
How much data will a customer use, in GB, per year? Is it possible to predict what the yearly bandwidth will be for the customer using several explanatory variables?
### A2. 
The goal is to accurately predict the amount of bandwidth a customer may utilize, which will aid stakeholders in decisions about network resource allocation and adjusting customer bandwidth limits

# Part ll: Method Justification

### B1: There are 5 assumptions made of the multiple regression model:
   - There exists a linear relationship between each explanatory variable and the response variable.
   - None of the explanatory variables are highly correlated with each other.
   - The observations are independant.
   - The residuals have constant variance at everyl point in the linear model.
   - The residuals of the model are normally distributed
   
(Zach. 2020)

### B2:
I've chosen to use Python for the multiple regression analysis. Python has a large following and is heavily used in academic and industrial circles, which means that there are plenty of useful analytics libraries available such as, statsmodels, seaborn, and more. (Terra, J. 2021)
### B3: 
Multiple regression is an appropriate technique to analyze the research question because our response variable, the bandwidth used in GB yearly, is a continuous variable and there may be multiple explanatory variables, such as children, tenure, income, or age that can help predict how much bandwidth a customer will use in a year.

# Part III: Data Preparation

### C1: Data preparation and manipulations:
   - Import the relevant Python libraries I plan on using in this notebook
   - Load the raw churn data into a Pandas dataframe so it can be read and manipulated appropriately
   - Rename columns that are not descriptive to the data they represent, such as the survey questions labeled item1-8

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import statsmodels.api as sm

# Raw Data to DataFrame
df = pd.read_csv('churn_clean.csv', encoding='utf-8', index_col=0)

# Rename survey columns for better readability 
df.rename(columns = { 
    'Item1': 'Timely_response'
    'Item2': 'Timely_fixes', 
    'Item3': 'Timely_replacements', 
    'Item4': 'Reliability',
    'Item5': 'Options',
    'Item6': 'Respectful_response',
    'Item7': 'Courteous_exchange',
    'Itme8': 'Active_listening'
}, inplace=True)

### C2: Summary Statistics

The churn data contains 10,000 customers and 50 columns/variables. For the multiple regression analysis I've selected **Bandwidth_GB_Year** to be the **responsive** variable and the following 16 variables as possible **explanatory** variables:

- Children
- Age
- Income
- Marital
- Gender
- Outage_sec_perweek
- Contacts
- Yearly_equip_failure
- Techie
- Contract
- Tenure
- MonthlyCharge
- Timely_response
- Timely_fixes
- Timely_replacements
- Reliability

Variables omitted from the analysis are unique customer identifiers, customer location data, and other variables I felt did not logically contribute to the research question such as particular survey questions.

**Continuous numeric variables**:

- Children
- Age
- Income
- Outage_sec_perweek
- Contacts
- Yearly_equip_failure
- Tenure
- MonthlyCharge

**Categorical variables** 

- Marital
- Gender
- Techie
- Contract
- Timely_response
- Timely_fixes
- Timely_replacements
- Reliability


In [23]:
df.describe()

Unnamed: 0,Zip,Lat,Lng,Population,Children,Age,Income,Outage_sec_perweek,Email,Contacts,...,MonthlyCharge,Bandwidth_GB_Year,Timely_response,Timely_fixes,Timely_replacements,Reliability,Options,Respectful_response,Courteous_exchange,Item8
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,49153.3196,38.757567,-90.782536,9756.5624,2.0877,53.0784,39806.926771,10.001848,12.016,0.9942,...,172.624816,3392.34155,3.4908,3.5051,3.487,3.4975,3.4929,3.4973,3.5095,3.4956
std,27532.196108,5.437389,15.156142,14432.698671,2.1472,20.698882,28199.916702,2.976019,3.025898,0.988466,...,42.943094,2185.294852,1.037797,1.034641,1.027977,1.025816,1.024819,1.033586,1.028502,1.028633
min,601.0,17.96612,-171.68815,0.0,0.0,18.0,348.67,0.099747,1.0,0.0,...,79.97886,155.506715,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,26292.5,35.341828,-97.082813,738.0,0.0,35.0,19224.7175,8.018214,10.0,0.0,...,139.979239,1236.470827,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
50%,48869.5,39.3958,-87.9188,2910.5,1.0,53.0,33170.605,10.01856,12.0,1.0,...,167.4847,3279.536903,3.0,4.0,3.0,3.0,3.0,3.0,4.0,3.0
75%,71866.5,42.106908,-80.088745,13168.0,3.0,71.0,53246.17,11.969485,14.0,2.0,...,200.734725,5586.141369,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
max,99929.0,70.64066,-65.66785,111850.0,10.0,89.0,258900.7,21.20723,23.0,7.0,...,290.160419,7158.98153,7.0,7.0,8.0,7.0,7.0,8.0,7.0,8.0


# I. Sources

Zach. (2021, November 16). _The five assumptions of multiple linear regression._ Statology. Retrieved January 9, 2022, from https://www.statology.org/multiple-linear-regression-assumptions/ 

Terra, J. (2021, July 22). _Python for data science and data analysis._ Simplilearn.com. Retrieved January 9, 2022, from https://www.simplilearn.com/why-python-is-essential-for-data-analysis-article#why_is_python_essential_for_data_analysis 