# STA130 HW 7 pre lecture

### chatbot session link:https://chatgpt.com/share/67317097-d690-800e-bfb7-39ccfec1d33f
### chatbot summary:
Certainly! Here’s a structured summary of our interaction that could work well as an assignment overview:

---

**Assignment Summary: Exploring Linear Regression Techniques and Data Visualization in Python**

**Objective**  
The goal of this assignment is to understand the fundamental concepts of linear regression, specifically the differences between simple and multiple linear regression, how to interpret and apply interaction terms, and how to implement regression analysis and plot a linear equation using Python. 

**1. Simple vs. Multiple Linear Regression**  
We started by discussing the distinction between simple linear regression and multiple linear regression:
   - **Simple Linear Regression** models the relationship between one dependent variable and one independent variable.
   - **Multiple Linear Regression** incorporates multiple independent variables, allowing a more complex and realistic representation of real-world scenarios.
   - **Benefits of Multiple Linear Regression** include improved predictive power, realistic modeling, and the ability to analyze the effect of several variables simultaneously.

**2. Interaction Terms in Multiple Linear Regression**  
Next, we covered adding interaction terms to a multiple linear regression model:
   - Interaction terms allow us to examine how the effect of one predictor on the dependent variable might vary depending on the value of another predictor.
   - In Python, you can add an interaction term in `statsmodels` using the formula syntax: `y ~ x1 + x2 + x1:x2`.

**3. Conducting Multiple Linear Regression Using `statsmodels` in Python**  
We reviewed the steps for performing multiple linear regression using `statsmodels.formula.api.ols`:
   - Define the model formula, specify the dataset, fit the model, and view a summary of the results.
   - The summary output provides valuable statistical metrics, including coefficients, p-values, and R-squared values.

**4. Troubleshooting `ValueError` in Linear Regression Models**  
While implementing multiple linear regression, a `ValueError` was encountered due to a misinterpretation of the dependent variable. Solutions included:
   - Ensuring all variables are numeric.
   - Checking for missing or invalid values.
   - Confirming column names match the formula exactly.

**5. Plotting a Line Given an Equation Using Plotly Express**  
Finally, we explored plotting a line given its equation using Plotly Express:
   - Define the line equation (e.g., `y = m * x + b`) and generate x-y pairs.
   - Use `plotly.express.line` to visualize the line, creating an interactive and customizable plot.

**Conclusion**  
This assignment introduced key concepts in linear regression, including modeling with multiple variables, interpreting interaction effects, and visualizing linear relationships. Through these exercises, we gained practical experience using Python’s `statsmodels` and `plotly` libraries to perform regression analysis and generate visualizations, essential skills for data science and statistical modeling.

--- 

This summary captures the main points of our discussion and could serve as an assignment conclusion or report on linear regression and visualization techniques in Python. Let me know if you'd like more details on any section!

### 1. Explain succinctly in your own words (but working with a ChatBot if needed)...

#### 1. the difference between Simple Linear Regression and Multiple Linear Regression; and the benefit the latter provides over the former

simple linear regression works on single predictor, while multiple linear regression can work on multiple predictors

multiple linear regression provides a more realistic and precise prediction of the population.

#### 2. the difference between using a continuous variable and an indicator variable in Simple Linear Regression; and these two linear forms

the indicator variable is used in binary data, which provides a contrast between 2 groups.

when using a continuous data, the form is as follows:

$Y_i$ = $\beta_0$ + $\beta_1$ * $x_i$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

when using an indicator:

$Y_i$ = $\beta_0$ + $\beta_1$ * $1_{["some\ group"]}(k_i)$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

#### 3. the change that happens in the behavior of the model (i.e., the expected nature of the data it models) when a single indicator variable is introduced alongside a continuous variable to create a Multiple Linear Regression; and these two linear forms (i.e., the Simple Linear Regression versus the Multiple Linear Regression)



if another indicator is added to the simple linear regression model, the graph would instead look like a collection of lines where the slope of each line is the same while the intercepts are different.

the simple linear regression

$Y_i$ = $\beta_0$ + $\beta_1$ * $x_i$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

the regression model after adding another indicator(multiple linear regression)

$Y_i$ = $\beta_0$ + $\beta_1$ * $x_i$ + $\beta_2$ * $1_{["some\ group"]}(k_i)$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

#### 4. the effect of adding an interaction between a continuous and an indicator variable in Multiple Linear Regression models; and this linear form

it allows us to examine the relationship of how the relationship between the outcome and the predictor changes when the other predictor changes.

$Y_i$ = $\beta_0$ + $\beta_1$ * $x_i$ + $\beta_2$ * $1_{["some\ group"]}(k_i) * x_i$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

#### 5. the behavior of a Multiple Linear Regression model (i.e., the expected nature of the data it models) based only on indicator variables derived from a non-binary categorical variable; this linear form; and the necessarily resulting binary variable encodings it utilizes

if the model based only on indicator variables, the graph is a collection of dots in different height

$Y_i$ = $\beta_0$ + $\beta_1$ * $1_{["some\ group"]}(k_1i)$ + $\beta_2$ * $1_{["some\ group"]}(k_2i)$ + $...$ + $\epsilon_i$, where $\epsilon_i$ belong to $N$($\mu$, $\sigma$)

the whole model shows the difference between each group where $\beta_0$ is the baseline group.

### 2. Explain in your own words (but working with a ChatBot if needed) what the specific (outcome and predictor) variables are for the scenario below; whether or not any meaningful interactions might need to be taken into account when predicting the outcome; and provide the linear forms with and without the potential interactions that might need to be considered

`Imagine a company that sells sports equipment. The company runs advertising campaigns on TV and online platforms. The effectiveness of the TV ad might depend on the amount spent on online advertising and vice versa, leading to an interaction effect between the two advertising mediums.`

variables: effectiveness of TV ad(outcome), amount spent on TV ads(predictor), effectiveness of online ads(outcome), amount spent on online ads(predictor)

there's an interaction between the effectiveness of TV ad and amount spent on online ads, and vice versa

denote $Y_{[TV]i}$ as the effectiveness of TV ads, and $Y_{[online]i}$ as the effectiveness of online ads,
$x_{[TV]i}$ as amount spent on TV ads, and $x_{[online]i}$ as amount spent on online ads

with interactions:

$Y_{[TV]i}$ = $\beta_0 + \beta_1 x_{[TV]i} + \beta_2 x_{[online]i}$

$Y_{[online]i}$ = $\beta_0 + \beta_1 x_{[online]i} + \beta_2 x_{[TV]i}$

without interactions:

$Y_{[TV]i}$ = $\beta_0 + \beta_1 x_{[TV]i}$

$Y_{[online]i}$ = $\beta_0 + \beta_1 x_{[online]i}$

#### 1. Explain how to use these two formulas to make predictions of the outcome, and give a high level explaination in general terms of the difference between predictions from the models with and without the interaction

to fit the model, fit the current data with the model. To make predictions, quantify the predictors and plug it into the formula

the outcome in the models without prediction would have a low level of accuracy as it doesn't take the interaction into account, which is against the fact. Similarly, for models with interaction, the outcome is expected to be more precise.

#### 2.  Explain how to update and use the implied two formulas to make predictions of the outcome if, rather than considering two continuous predictor variables, we instead suppose the advertisement budgets are simply categorized as either "high" or "low" (binary variables)

to update the model, simply change the continuous predictor into indicators, assign the data with "high" or "low" labels, and fit the model. 

to make predictions, assign the quantified predictor the label and subsitude into the model.

### 3. Use smf to fit multiple linear regression models to the course project dataset from the canadian social connection survey

In [2]:
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/pointOfive/stat130chat130/refs/heads/main/CP/CSCS_data_anon.csv"
df = pd.read_csv(url).fillna('None')

  df = pd.read_csv(url).fillna('None')


In [13]:
original_data = df.loc[:, "COVID_prevention_distancing":"COVID_vaccinated"]
original_data

Unnamed: 0,COVID_prevention_distancing,COVID_prevention_masks,COVID_prevention_hand_washing,COVID_prevention_reduce_people,COVID_prevention_avoid_trips,COVID_prevention_household,COVID_vaccinated
0,Somewhat closely,Somewhat closely,Somewhat closely,Not at all,Not at all,Not at all,"Yes, three or more doses"
1,Very closely,Not at all,Very closely,Very closely,Very closely,Very closely,"Yes, three or more doses"
2,Somewhat closely,Not at all,Somewhat closely,Not at all,Not at all,Somewhat closely,"Yes, three or more doses"
3,Not at all,Not at all,Somewhat closely,Not at all,Not at all,Not at all,"Yes, three or more doses"
4,Very closely,Very closely,Very closely,Very closely,Very closely,Somewhat closely,"Yes, one dose"
...,...,...,...,...,...,...,...
11426,Somewhat closely,Not at all,Somewhat closely,Somewhat closely,Very closely,Not at all,"Yes, three or more doses"
11427,Not at all,Not at all,Very closely,Not at all,Not at all,Not at all,"Yes, three or more doses"
11428,Somewhat closely,Somewhat closely,Very closely,Somewhat closely,Somewhat closely,Somewhat closely,"Yes, two doses"
11429,Not at all,Not at all,Very closely,Somewhat closely,Not at all,Not at all,"Yes, two doses"


In [28]:
map_rule = {
    "Very closely":3,
    "Somewhat closely":2,
    "Not at all":1,
    "Yes, three or more doses":3,
    "Yes, two doses":2,
    "Yes, one dose":1,
    "No":0
}
data = original_data.replace(map_rule)
data_copy = data.loc[:,["COVID_prevention_distancing","COVID_prevention_masks","COVID_vaccinated"]]

data_copy['COVID_vaccinated'] = pd.to_numeric(data_copy['COVID_vaccinated'], errors='coerce')
data_copy['COVID_prevention_distancing'] = pd.to_numeric(data_copy['COVID_prevention_distancing'], errors='coerce')
data_copy['COVID_prevention_masks'] = pd.to_numeric(data_copy['COVID_prevention_masks'], errors='coerce')
data_copy.dtypes

COVID_prevention_distancing    float64
COVID_prevention_masks         float64
COVID_vaccinated               float64
dtype: object

In [30]:
s = "COVID_vaccinated ~ COVID_prevention_distancing + COVID_prevention_masks"
#  + COVID_prevention_hand_washing + COVID_prevention_reduce_people + COVID_prevention_avoid_trips + COVID_prevention_household
model = smf.ols(formula=s, data=data_copy).fit()

result = model.summary()
result

0,1,2,3
Dep. Variable:,COVID_vaccinated,R-squared:,0.042
Model:,OLS,Adj. R-squared:,0.042
Method:,Least Squares,F-statistic:,165.7
Date:,"Mon, 11 Nov 2024",Prob (F-statistic):,3.65e-71
Time:,01:52:11,Log-Likelihood:,-10920.0
No. Observations:,7510,AIC:,21850.0
Df Residuals:,7507,BIC:,21870.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.5175,0.041,61.988,0.000,2.438,2.597
COVID_prevention_distancing,-0.0985,0.021,-4.734,0.000,-0.139,-0.058
COVID_prevention_masks,-0.2017,0.019,-10.888,0.000,-0.238,-0.165

0,1,2,3
Omnibus:,1063.65,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,444.783
Skew:,-0.41,Prob(JB):,2.61e-97
Kurtosis:,2.134,Cond. No.,12.3


the model measures the relationship between COVID_vaccinated and 2 categorical variable COVID_prevention_distancing and COVID_prevention_masks. To predict, plug in the corresponding categorical variable.

Since the p-value < 0.001, we have strong evidence against the statement that there are no relationship between COVID_vaccinated and the 2 categorical variable

In [41]:
l1 = pd.DataFrame(data_copy["COVID_prevention_distancing"].copy())
l1['predicted'] = 2.5175 + (-0.0985) * data_copy["COVID_prevention_distancing"]

fig = px.scatter(l1, x="COVID_prevention_distancing", y="predicted")
fig.show()

In [43]:
l2 = pd.DataFrame(data_copy["COVID_prevention_masks"].copy())
l2['predicted'] = 2.5175 + (-0.2017) * data_copy["COVID_prevention_masks"]

fig = px.scatter(l2, x="COVID_prevention_masks", y="predicted")
fig.show()

### 4. Explain the apparent contradiction between the factual statements regarding the fit below that "the model only explains 17.6% of the variability in the data" while at the same time "many of the coefficients are larger than 10 while having strong or very strong evidence against the null hypothesis of 'no effect'"

"the model only explains 17.6% of the variability in the data": this statement measures the extent of the data being fitted and the accuracy of the model for prediction.

"many of the coefficients are larger than 10 while having strong or very strong evidence against the null hypothesis of 'no effect'":
this statement is a conclusion of a hypothesis test and reveals that most of the coefficients are larger than 10 