## Lasso Regression Modelling

#### Using lasso regression, let's predict a citizen's perception of Chinese Influence 

Lasso regression provides the opportunity to compare models to determine if regression (treating response variables as continuous even though they are ordinal) would fit the data best. 

The following modeling allows for manipulation of class labels, categorical or binary responses, and feature selection. This is an experimental model and should not be counted as fully deterministic. 

In [1]:
import numpy as np
# set a seed for reproducibility
np.random.seed(23)

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

# to scale our data
from sklearn.preprocessing import MinMaxScaler

# to generate k-folds from the data
from sklearn.model_selection import KFold, train_test_split

# use sklearn.metrics to calculate a confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# to calulate distance matrix
from sklearn.metrics import pairwise_distances

# to use in building graphs for DT
import graphviz 

# to use in viewing feature importance from random forest 
import seaborn as sns

# to export dataframes
import dataframe_image as dfi

# to build out Lasso regression 
from sklearn.linear_model import LassoCV

In [7]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1315 sha256=90ac19b34e26070fa6f8c82c861a6b6c734c57eec1f206ac7c7271230b2c64ea
  Stored in directory: /Users/natalie_kraft/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


In [2]:
# import data
model = pd.read_csv(r'/Users/natalie_kraft/Documents/LAS/NIGr6_trans_dt_reduced.csv')

In [3]:
model = model.drop(columns=['Unnamed: 0'])

## The following functions are defined as... 

> discretize (mapping, labelName, df): turns the categorical/ordinal response into binary or other discretized value   

> dtree(x_train, y_train, x_test, criterion, random_state, depth, prob): creates a decision tree using the training data set and then predicts the labels for the test dataset
- users can specify tree depth as well as use of gini/entropy 
- return the tree and the prediction accuracy

> randomforest(x_train, y_train, x_test, criterion, random_state, depth, prob): builds 100 decision trees to determine relevance of attributes based on different split
- returns all of the trees and the prediction accuracies 

> featureImportance(x_train, y_train, x_test, criterion, random_state, depth, prob, runningModel, threshold): builds a random forest, calculates feature importance based off the forest, determines relevant attributes to model, builds reduced attribute model
- returns model with reduced attributes based upon threshold set

> evaluation_measures(y_true, y_pred): determining accuracy and confusion matrix

In [25]:
def discretize( mapping, labelName, df ):

    # what if class labels are binary 

    if mapping != None:
        df[labelName] = df[labelName].map(mapping)
    
    # drop tribe_names first as it has string data (can't be imputed)
    if labelName == "tribe_name":
        labels = df['tribe_name']
    
    df = df.drop(["tribe_name"], axis=1)
        
    # drop columns with only missing values
    todrop = []
    # summarize the number of rows with missing values for each column
    for i in range(df.shape[1]):
        # count number of rows with missing values
        n_miss = df[df.columns[i]].isna().sum()
        if n_miss == 2400: 
            # print("WARNING: /n/n")
            todrop.append(df.columns[i])
        perc = n_miss / df.shape[0] * 100
        # print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

    df = df.drop(columns=todrop)
    
    # fill in missing values 
    imputer = KNNImputer(n_neighbors=3, weights="uniform")
    t = imputer.fit_transform(df)

    df = pd.DataFrame(t, columns=df.columns)
    
    # round all labels to a whole number 
    if not labelName == "tribe_name":
        labels = df[labelName]
  
    
    # remove unneeded labels 
    vars = ['econInfluence', 'posnegInfluence', 'posImageChina', 'negImageChina', 
            'econAssistance', 'region']
    df = df.drop(vars, axis=1)
 
    return df, labels



In [26]:
def evaluation_measures(y_true, y_pred):

    # binary equivalent for accuracy
    # cm = confusion_matrix(true_labels, pred_labels)
    # accuracy = (cm[0][0] + cm[1][1]) / (cm[0][0] + cm[1][1] + cm[0][1] + cm[1][0])
    
    cm = confusion_matrix(y_true, y_pred)
    # tp + tn / tp + tn + fp + fn
    # generic values used in anticipation of test cases having more than two class labels
    accuracy = sum(cm[i][i] for i in range(cm.shape[0])) / cm.sum()
    
    # return as a list
    return [accuracy, cm]

In [33]:
def feature_importance(reg, accuracy, modelColumns):
    
    features = []
    removedFeatures = []
    # Which attributes are most predictive of the outcome variable?
    for i in range(0, len(reg.coef_)):
        if reg.coef_[i] != 0:
            features.append([modelColumns[i], reg.coef_[i]])
        else:
            removedFeatures.append(modelColumns[i])
            
   
    print(f'Model coefficients:\n{features}')
    
    print()
    print(f'Model features removed include:\n {removedFeatures}')

    print()
    # Note we called this 'lamda' in class, but sklearn calls it alpha (should be ~3.196)
    print(f'The shinkage coefficient hyperparameter chosen by CV: {reg.alpha_}')

In [28]:
def regression_lasso(X_train, X_test, y_train, y_test, i, random_state, printC, modelColumns):
    
    reg = LassoCV(cv=i, random_state=random_state).fit(X_train, y_train)
    pred = reg.predict(X_test)
    
    pred = np.around(pred)
    accuracy, cm = evaluation_measures(y_test, pred)
    
    if printC is True:
        feature_importance(reg, accuracy, modelColumns)
        
    return [accuracy, cm, pred]

In [29]:
def regression_lasso_cross(label, X_train, X_test, y_train, y_test, random_state):
    
    scatter = []
    highestVal = 0
    bestCross = -1
    
    for i in range(2, 20):
        
        evaluations = regression_lasso(X_train, X_test, y_train, y_test, i, random_state, False, None)
        
        scatter.append([label, i, evaluations[0]])
        if evaluations[0] > highestVal: 
            highestVal = evaluations[0]
            bestCross = i
        
    return scatter, highestVal, bestCross

## Predict your model

#### What do you want to predict? Choose your variable first:
- q81a - econInfluence   - China's economic quantity of influence 
- q81b - posnegInfluence - China's positive or negative influence 
- q81c - posImageChina   - Positive quality of Chinese influence 
- q81d - negImageChina   - Negative quality of Chinese influence 
- q81e - econAssistance  - China's economic assistance meets country needs 

In [30]:
# choose your variable here 
predictor = "econAssistance"

# testing dataset size
size = 0.25

In [31]:
# listing of all of the mappings to remove invalid outputs 

# econInfluence
key_q81a_all =  {
    9:-1, # don't know
    0:1, # None
    1:2, # A little 
    2:3, # some
    3:4, # alot 
}

key_q81a =  {
    0:0, # None
    1:1, # A little 
    2:1, # some
    3:1, # alot 
}

# posnegInfluence
key_q81b_all =  {
    1:1, # very negative
    2:2, # somewhat negative
    3:3, # neutral 
    4:4, # somewhat positive
    5:5, # very positive
    9:3, # neutral 
}

key_q81b =  {
    1:1, # very negative
    2:1, # somewhat negative
    3:2, # neutral 
    4:3, # somewhat positive
    5:3, # very positive
    9:2, # neutral 
}

# posImageChina
key_q81c_all = {
    1:1, # support for international affairs 
    2:2, # non interference in African affairs 
    3:3, # investment in infrastructure 
    4:4, # business investment 
    5:5, # cost of Chinese goods 
    6:6  # soft power
}

key_q81c = {
    1:1, # support for international affairs 
    2:2, # non interference in African affairs 
    3:3, # investment in infrastructure 
    4:3, # business investment 
    5:4, # cost of Chinese goods 
    6:5  # soft power
}

# negImageChina
key_q81d_all = {
    1:1, # extracting resources
    2:2, # land grabbing
    3:3, # corruption
    4:4, # taking jobs from locals 
    5:5, # quality of Chinese goods 
    6:6  # soft power
}

key_q81d = {
    1:1, # extracting resources
    2:1, # land grabbing
    3:2, # corruption
    4:3, # taking jobs from locals 
    5:3, # quality of Chinese goods 
    6:4  # soft power
}

# econAssistance
key_q81e_all =  {
    1:1, # very bad
    2:2, # somewhat bad
    3:3, # neutral 
    4:4, # somewhat good
    5:5, # very good
    9:3, # dont know
}

key_q81e =  {
    1:1, # very bad
    2:1, # somewhat bad
    3:2, # neutral 
    4:3, # somewhat good
    5:3, # very good
    9:2, # dont know
}

allKeys = [key_q81a_all, key_q81a, key_q81b_all, key_q81b, key_q81c_all, key_q81c, 
           key_q81d_all, key_q81d, key_q81e_all, key_q81e]

allVars = ['econInfluence', 'econInfluence', 'posnegInfluence', 'posnegInfluence',
           'posImageChina', 'posImageChina', 'negImageChina', 'negImageChina', 'econAssistance', 'econAssistance']

### Run model with each response variable 

Determine accuracy for each response variable based off a determination on model's best cross validation splits

__Average results for full and reduced responses__
In order to show the benefits of reducing the categorical/ordinal responses of response variables into larger 
groups, the average for all predictors is laid out. 

In [34]:
modelStatus = []

for idx, key in enumerate(allKeys): 
    
    print("\nLet's work on " + allVars[idx] + "\n")
    
    runningModel, labels = discretize (key, allVars[idx], model.copy())
    
    labels = np.around(labels)
        
    print(labels.value_counts())

    # data is split into training/test 
    # split into train and test in a stratified manner
    xtrain, xtest, ytrain, ytest = train_test_split(runningModel, 
                                                    labels, 
                                                    test_size = size)

    
    scatter, highestVal, bestCross = regression_lasso_cross(key, xtrain, xtest, ytrain, ytest, 33)
    
    modelStatus.append([allVars[idx], highestVal, bestCross])
    
    print("Valuable features for " + allVars[idx] + " include: \n")
    regression_lasso(xtrain, xtest, ytrain, ytest, bestCross, 33, True, runningModel.columns )
    
sumFull = 0
sumPart = 0

for i in range(0, len(modelStatus)): 
    if i % 2 == 0: 
        # if even, it is the full thing (less accurate)
        sumFull = sumFull + modelStatus[i][1]
    else: 
        # if odd, it is consolidated version 
        sumPart += modelStatus[i][1]

print("Full model average : " + str(sumFull / 5))
print("Consolidated model average : " + str(sumPart / 5))

modelStatus


Let's work on econInfluence

 4.0    806
 3.0    776
 2.0    429
-1.0    305
 1.0     84
Name: econInfluence, dtype: int64
Valuable features for econInfluence include: 

Model coefficients:
[['urban', 0.15456230651372263], ['cellService', 0.17628427906734234], ['roadblocksCom', 0.1604155485895], ['age', 0.0036375412331899914], ['livingConditionOthers', -0.03419967128882476], ['prevEconConditions', -0.07709458053005282], ['oftenWOWater', 0.029154363726672267], ['oftenWOFuel', 0.026209054953383385], ['remittance', 0.04729436294884446], ['fearedCrime', 0.01673487356631471], ['radioNews', 0.06326847309249557], ['tvNews', 0.09833946708894578], ['newspaperNews', 0.021294233804961114], ['internetNews', 0.023505642686649363], ['socialMediaNews', 0.0394928663326519], ['interestPA', 0.0009733928620541748], ['discussPolitics', 0.02226456141962671], ['freePolitics', 0.015744996035034042], ['freeVote', 0.034736858890806514], ['govBanOrgs', -0.07394405724436307], ['govControlMedia', 0.0032621948242

4.0    1030
3.0     639
5.0     567
2.0     120
1.0      44
Name: posnegInfluence, dtype: int64
Valuable features for posnegInfluence include: 

Model coefficients:
[['urban', 0.03204062026288049], ['cellService', 0.15503930782392847], ['policeStation', -0.0025126300030322204], ['healthClinic', 0.052883802615355975], ['bank', -0.06251815559148145], ['roadblocksArmy', 0.018685237668397785], ['pavedRoad', -0.008058938269764471], ['age', -0.0012473079499995383], ['countryEconCondition', -0.01935625273361782], ['livingCondition', 0.005807702283144825], ['livingConditionOthers', -0.046182686238336525], ['prevEconConditions', -0.0213244131395899], ['futureEconConditions', 0.01261671162498895], ['oftenWOFood', 0.007939296754684399], ['oftenWOWater', -0.047446271905114246], ['oftenWOFuel', 0.03289890393378414], ['oftenWOCash', -0.00558714565759665], ['remittance', 0.014984931093481584], ['unsafeNeighborhood', -0.02377503854907279], ['fearedCrime', 0.015962837442874727], ['beenRobbed', 0.013317

4.0    820
5.0    571
3.0    555
1.0    250
2.0    170
6.0     34
Name: posImageChina, dtype: int64
Valuable features for posImageChina include: 

Model coefficients:
[['pipedWater', -0.01132877699469919], ['roadblocksCom', -0.1338867264616438], ['pavedRoad', 0.07886608768481813], ['roadImpass', -0.010688672424430935], ['age', -0.0018215135199870264], ['countryEconCondition', -0.031091607072460177], ['livingConditionOthers', 0.001230146589229459], ['futureEconConditions', 0.021607860022868047], ['oftenWOWater', 0.012756193785133417], ['oftenWOMed', -0.019759290738152465], ['unsafeNeighborhood', -0.02361201234771722], ['fearedCrime', -0.009633376615609429], ['radioNews', 0.00198704822478499], ['newspaperNews', -0.01290031743372659], ['internetNews', -0.0012357282430056557], ['socialMediaNews', -0.011454458865049269], ['interestPA', 0.0009585251572896917], ['freeSpeech', 0.005619295476757012], ['govControlMedia', -0.014742157174412034], ['involveReligous', -0.07544593101723214], ['involv

5.0    1158
4.0     478
3.0     267
1.0     239
2.0     136
6.0     122
Name: negImageChina, dtype: int64
Valuable features for negImageChina include: 

Model coefficients:
[['urban', 0.03147608245869668], ['solidiersSeen', 0.16444265444906264], ['roadblocksCom', -0.19377924286754986], ['roadImpass', -0.06575092908351186], ['age', 0.004579523954035672], ['rDirection', 0.025410907212867765], ['livingCondition', -0.066989543608724], ['futureEconConditions', 0.0441229277902404], ['oftenWOWater', -0.01409131077136711], ['oftenWOMed', -0.016292966621180564], ['oftenWOFuel', 0.0423601731657848], ['oftenWOCash', 0.038643288474824934], ['remittance', -0.05690092112031845], ['unsafeNeighborhood', 0.00493644563645455], ['tvNews', 0.004942718654225469], ['newspaperNews', -0.0442910409930033], ['socialMediaNews', -0.00507525171515394], ['discussPolitics', -0.037605684743030526], ['freeSpeech', 0.07924855905611132], ['freePolitics', 0.07186885622050536], ['freeVote', -0.0607724557583645], ['govBanO

4.0    859
3.0    683
5.0    528
2.0    253
1.0     77
Name: econAssistance, dtype: int64
Valuable features for econAssistance include: 

Model coefficients:
[['urban', 0.04927690652576329], ['policeStation', -0.055186761521790896], ['roadblocksArmy', 0.02601672047269318], ['age', -0.0005361763079712855], ['rDirection', -0.04748184200868559], ['livingCondition', -0.02055378409422732], ['livingConditionOthers', -0.021027605392430933], ['prevEconConditions', -0.05432853884965245], ['futureEconConditions', 0.01640385243126755], ['oftenWOMed', -0.023112080518031465], ['oftenWOCash', -0.018037500995974547], ['remittance', 0.028169672478951083], ['fearedCrime', -0.008645396688365371], ['newspaperNews', 0.03813151874917287], ['internetNews', 0.0038523651670651295], ['freeVote', 0.012662291393491672], ['govBanOrgs', -0.03564623619587587], ['involveCommunityGroup', -0.00046097634047427197], ['freqCommunityMeetings', -0.04751253092378118], ['freqRaiseIssue', 0.02596429781801621], ['voted', 0.082

[['econInfluence', 0.28833333333333333, 2],
 ['econInfluence', 0.96, 2],
 ['posnegInfluence', 0.46, 15],
 ['posnegInfluence', 0.6316666666666667, 2],
 ['posImageChina', 0.3466666666666667, 4],
 ['posImageChina', 0.59, 2],
 ['negImageChina', 0.30666666666666664, 9],
 ['negImageChina', 0.53, 4],
 ['econAssistance', 0.37333333333333335, 7],
 ['econAssistance', 0.48833333333333334, 8]]