1) Read the following highly-cited article by Fearon and Laitin (2003): https://cisac.fsi.stanford.edu/publications/ethnicity_insurgency_and_civil_war  

a. During a certain period of time, there was a lot of violence, resulting in the deaths of millions. A huge contributor to the death toll were civil wars. Specifically, 127 civil wars that killed at
least 1,000, 25 of which were ongoing in 1999. As a result of these conflicts, an estimated total of 16.2 million died.

This paper explores many questions surrounding those civil wars: 

What explains the recent prevalence of violent civil conflict around the world? 
Is it due to the end of the Cold War and associated changes in the international system, or is it the result of longer-term trends? 
Why have some countries had civil wars while others have not?   

This paper answers/tests these questions to find a definitve conclusion. They will utilize the data from that period. However, the authors clarified that the causes of civil war in the 90s was not due to the end of the Cold War and associated changes in the international system, religiously or ethnically diversed countries are not more prone to civil war, and that they cannot predict a start of a civil war

b. There are 127 observations because the data recorded 127 conflicts. 

c. Independent: Prior War, Per Capita Income, Log(population), Log(%Maintanous), Noncontiguous state, Oil exporter, New state, Instability, Democracy, Ethnic fractionalization   

Religious/Ethnic Diversity, Political Democracy/Issues, Income Ineqaulity, 

Dependent: Probability of: Civil War, "Ethnic" War, Civil War (Dummy Variable), Civil War (Plus Empires), Civil War (COW) 

(COW) = Correlates of War

d. The coefficents (the values that are not in parentheses) represent the relatioship between the independent and dependent varaibles. If the coefficent is positive, then the relationship is positive. If the coefficent is negative, then the relationship is negative.

e. The following independent variables have a positive relationship with each dependent variable: log(population), log(% mountainous), Oil Exporter, New State, Instability, Democracy, Religious fractionalization, Anocracy 

The following independent variables have a positive relationship with every dependent variable except for Civil War (COW). They have a negative relationship with Civil War (COW): Noncontigous state, Ethnic Fractionalization

The following independent variables have a negative relationship with each dependent variable: Prior War, Per Capita Income.

f. Overall, the independent variable that has a greater range is the New stae

2) Build a two-class logistic regression model from scratch. Load the Breast Cancer Wisconsin Dataset provided by sklearn:

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd

def generateXVector(x):
    vectorX = np.c_[np.ones((len(x), 1)), x]
    return vectorX

def theta_init(x):
    theta = np.random.randn(len(x[0])+1, 1)
    return theta

def sigmoid_f(x):
    return 1/(1+np.e**(-x))

def classifier_f(x, theta):
    h_theta = sigmoid_f(x.dot(theta))
    return h_theta

def gradient_f(x,y,theta):
    m = len(x)
    grad = 1/m * x.T.dot(sigmoid_f(x.dot(theta))-y)
    return grad

def binary_loss_f(y,y_pred):
    return np.dot(y.T, np.log(y_pred))

def logistic_regression(x, y, learningrate, iterations):
    y_new = np.reshape(y, (len(y), 1))
    cost_1st = []
    vectorX = generateXVector(x)
    theta = theta_init(x)
    for i in range(iterations):
        theta = theta - learningrate * gradient_f(vectorX, y_new, theta)
        y_pred = classifier_f (vectorX, theta)
        entropy_1 = binary_loss_f(y_new ,y_pred)
        entropy_2 = binary_loss_f(1-y_new,1-y_pred)
        cost_value = -np.sum(entropy_1 + entropy_2) / len(y_pred)
    return theta, cost_value

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns = data.feature_names)
Y = data.target
y_1 = pd.DataFrame(Y)
scalar = MinMaxScaler()
x = scalar.fit_transform(X)
y = scalar.fit_transform(y_1)
result = logistic_regression(x, y, 0.01, 10000)
result

3) Implement the three following cross-validation algorithms from scratch: a. Leave-one-out cross-validation b. K-fold cross-validation c. Train-test split cross-validation

In [15]:
import logging
import random
import sklearn.datasets 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

random.seed(265)


def validate_leave_one_out(X, Y):
    lm = LinearRegression()
    ix_data  = list(range(X.shape[0]))
    X_index = np.array(ix_data)

    mse_list = []
    for i in ix_data:
        train_ix = np.delete(ix_data,i)
        test_ix = np.array([i])
        X_train, X_test = X[train_ix, :], X[test_ix, :]
        Y_train, Y_test = Y[train_ix], Y[test_ix]
        lm.fit(X_train, Y_train)
        Y_predict = lm.predict(X_test)
        mse = mean_squared_error(Y_test, Y_predict)
        mse_list.append(mse)
    print("LOOCV: {}".format(np.mean(mse_list)))
    return np.mean(mse_list)

# K fold
def k_fold_validation(X, Y, K=5):
    lm = LinearRegression()
    N = X.shape[0]
    X_K = np.array_split(X, 5)
    Y_K = np.array_split(Y, 5)
    mse_list = []
    for i in range(K):
        X_test = X_K[i]
        Y_test = Y_K[i]
        X_train = np.concatenate(X_K[:i] + X_K[i+1:])
        Y_train = np.concatenate(Y_K[:i] + Y_K[i+1:])
        lm.fit(X_train, Y_train)
        Y_predict = lm.predict(X_test)
        mse = mean_squared_error(Y_test, Y_predict)
        mse_list.append(mse)
    print("K-fold mse: {}".format(np.mean(mse_list)))
    return np.mean(mse_list)


def train_test_split_validation(X, Y, test_size=0.3, train_size=0.70):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, train_size=train_size, random_state=265)
    lm = LinearRegression()
    lm.fit(X_train, Y_train)
    Y_predict = lm.predict(X_test)
    mse = mean_squared_error(Y_test, Y_predict)
    print("train_test_split: {}".format(mse))

def test():
    df = sklearn.datasets.fetch_california_housing(download_if_missing=True, return_X_y=False,
                                              as_frame=True)
    sc = MinMaxScaler()
    X = df.data.values.copy()
    X_scaled = sc.fit_transform(X)
    y = df.target.values
    # split the dataset
    Y_scaled = sc.fit_transform(y.reshape((y.shape[0], 1)))
    Y_scaled = Y_scaled.reshape((Y_scaled.shape[0],))
    k_fold_validation(X_scaled, Y_scaled, K=5)
    train_test_split_validation(X_scaled, Y_scaled, test_size=0.30, train_size=0.70)
    validate_leave_one_out(X_scaled, Y_scaled)



In [16]:
test()

K-fold mse: 0.023734108506413422
train_test_split: 0.023161942880990504
LOOCV: 0.02245687523556031


Based on the result, LOOCV < Train/Test Split < K-Fold. LOOCV is the smallest because it uses all, checking it one-by-one. However, while it retrieves the smalles MSE, it takes the longest because the dataset is large.