
## 6.4 Linear Regression Analysis — Gun Violence Data

### Hypothesis
We hypothesize that the number of people injured (`n_injured`) or the number of victims (`n_victims`) can be used to predict the number of people killed (`n_killed`) in gun violence incidents.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline


In [None]:

df = pd.read_csv("cleaned_gunviolence.csv")
df_model = df[['n_killed', 'n_injured']].dropna()
X = df_model['n_injured'].values.reshape(-1, 1)
y = df_model['n_killed'].values.reshape(-1, 1)


In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

plt.figure(figsize=(6, 4))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted Line')
plt.xlabel('Number Injured')
plt.ylabel('Number Killed')
plt.title('Linear Regression: Injured vs. Killed')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

print("MSE:", mse)
print("R²:", r2)


In [None]:

df_model2 = df[['n_killed', 'n_victims']].dropna()
X2 = df_model2['n_victims'].values.reshape(-1, 1)
y2 = df_model2['n_killed'].values.reshape(-1, 1)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)
model2 = LinearRegression()
model2.fit(X2_train, y2_train)
y2_pred = model2.predict(X2_test)
mse2 = mean_squared_error(y2_test, y2_pred)
r2_2 = r2_score(y2_test, y2_pred)

plt.figure(figsize=(6, 4))
plt.scatter(X2_test, y2_test, color='green', label='Actual')
plt.plot(X2_test, y2_pred, color='red', linewidth=2, label='Predicted Line')
plt.xlabel('Number of Victims')
plt.ylabel('Number Killed')
plt.title('Linear Regression: Victims vs. Killed')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

print("MSE:", mse2)
print("R²:", r2_2)



## 6.4 Linear Regression Analysis — Gun Violence Data

### Hypothesis
We hypothesize that the number of people injured (`n_injured`) or the number of victims (`n_victims`) can be used to predict the number of people killed (`n_killed`) in gun violence incidents.

---

### Model 1: `n_injured` → `n_killed`
- **Mean Squared Error (MSE)**: 0.249
- **R² Score**: 0.027
- The model performs poorly, explaining only ~2.7% of the variance. The scatterplot shows significant dispersion, suggesting weak linear correlation.

---

### Model 2: `n_victims` → `n_killed`
- **Mean Squared Error (MSE)**: 0.230
- **R² Score**: 0.102
- Performance is slightly better; about 10.2% of variance is explained. The relationship is stronger than with `n_injured`, but still weak overall.

---

### Interpretation
The linear regression models do not perform well. This may be due to:
- Count nature of the data (discrete with many zeros)
- Skewed distributions and outliers
- Complex causal factors not represented in the predictors

---

### Reflections on Bias
- **Data sparsity**: Many events report zero fatalities, leading to skew.
- **Omitted variables**: Contextual or situational variables (e.g., weapon type, emergency response time) are not included.
- **Reporting bias**: Not all incidents may be reported uniformly.

---

### Conclusion
Simple linear regression is not sufficient for accurate prediction in this case. More advanced models (Poisson, NB regression, or tree-based models) and better feature engineering are recommended for future analysis.
