# Exercise 1

Given the complexity of analyzing visual data like protest pageantry, and the desire to avoid overly simplistic models, a combination of supervised and unsupervised machine learning techniques would likely be most relevant.  Specifically, convolutional neural networks (CNNs), a form of deep learning, would be suitable for image feature extraction.  Since the researcher has captions describing the protest's topic, this labeled data could be used in a supervised learning approach to train a CNN to recognize visual cues associated with different protest themes.  However, the researcher might also want to discover *unforeseen* visual similarities across protests, even those with different stated aims.  For this, unsupervised learning methods like clustering algorithms (e.g., k-means) could be applied to the extracted image features to group protests based on visual similarity, potentially revealing interesting patterns.  The captions could then be used *post hoc* to understand the themes of these visually-derived clusters.  There's ambiguity in how precisely the captions will be used – will they be the *sole* basis for supervised learning, or will they supplement other labeled data?  Also, the researcher could use transfer learning, leveraging pre-trained CNNs on large image datasets, to fine-tune the model for this specific task, potentially reducing the need for extensive labeled data.


# Exercise 2

In [5]:
import pandas as pd

data = pd.read_csv('formative_data.csv')
data

Unnamed: 0,x1,x2,x3,x4,x5,outcome
0,-1.451602,0.753562,1,4.104114,16.338545,9.403554
1,0.708100,0.041478,0,-0.616641,-8.945764,-0.310894
2,1.593097,0.627988,1,2.107764,-19.439949,8.174084
3,-0.837596,0.038860,1,2.745286,8.180159,4.413289
4,1.929470,0.076839,1,3.530637,-23.366917,18.086722
...,...,...,...,...,...,...
995,-1.079707,0.798373,1,2.025600,12.545803,4.888038
996,0.550541,0.010368,1,1.944260,-5.510572,5.830833
997,1.975929,0.269473,1,3.770674,-24.944807,18.453913
998,-0.007371,0.023332,0,-2.273461,-0.882945,-1.507773


In [15]:
# train test split
from sklearn.model_selection import train_test_split
X = data.drop('outcome', axis=1)
y = data['outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
from sklearn.linear_model import LinearRegression


# Create a linear regression model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict the test data
y_pred = model.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R^2: {r2}')

Mean Squared Error: 89.60973428956058
R^2: -0.03809981385944394
