## Imputation

### Step 1. Load the data

In [137]:
import csv 
import numpy as np

with open('covid_randomrowsa.csv', 'rt') as file:
    reader = csv.reader(file)
    data = list(reader)

# the first row is the feature names
print("There are in total of ", len(data) - 1, " tuples.");

There are in total of  4600  tuples.


### Step 2. Data Investigation

Something need to know before preceeding. The questionnaire contains a series of questions. I will list the question category, the number of questions of that category, the column index in the data file below.

| Category Name | Number of Questions | Column Indices |
|---------------|---------------------|----------------|
|Physical Contact| 5 | 7 - 11 |
|Physical Hygiene| 5 | 12 - 16 |
|Anti-corona Policy Support| 5 | 17 - 21 | 
|Generosity| 4 | 22 - 25 |
|Psychological Well-being| 2 | 26 - 27 |
|Collective Narcissism| 3 | 28 - 30 |
|National Identification| 2 | 31 - 32 |
|Comspiracy Thoeries| 4 | 33 - 36 | 
|Open-mindedness| 6 | 37 - 42 | 
|Morality as Cooperation| 7 | 43 - 49 | 
|Trait Optimism| 2 | 50 - 51 | 
|Social Belonging| 4 | 52 - 55 | 
|Trait Self-control| 4 | 56 - 59 | 
|Self-esteem| 1 | 60 |
|Narcissism| 6 | 61 - 66 | 
|Moral Identity| 10 | 67 - 76 |
|Risk Perception| 2 | 77 - 78 |
|Political Ideology| 1 | 79 | 
|Moral Circle| 1 | 80 | 
|Pysical Health| 1 | 81 | 
|Cognitive Reflection Test| 3 | 82 - 84 |
|Sex| 2 | 85 - 86 | 
|Age| 1 | 87 |
|Marital Status| 1 | 88 |
|Number of Children| 1 | 89 | 
|Employment Status| 2 | 90 - 91 |
|Ladder| 1 | 92 | 
|Urban?| 1 | 93 |
|Tested Positive| 1 | 94 |
|Known Tested Positive| 1 | 95 |


Notice:
- We will drop column for sex other input, as the sex column contains the other option.
- We will drop column for employment status for the same reason.
- We will drop column for Urban as there is no such information in the questionnaire at all.

### Step 3: Imputation

The mean value of the answers within the same catogory will be chosen for `NA` answer. If there is only one question within that category with an answer NA, that tuple will be dropped.

In [138]:
def average(piece, s, e):
    total = 0
    number = 0
    
    for i in range(s, e):
        if piece[i] == 'NA' or piece[i] == '':
            continue
        else:
            total += float(piece[i])
            number += 1
    if number == 0:
        return -1
    else:
        return total/number

def process(data):
    # set of indexes of tuples to be deleted
    to_delete = set()
    
    for i in range(1, len(data)):
        # processing one tuple

        # pysical contact
        for pair in [[7,11], [12,16], [17,21], [22,25], [26, 27], [28, 30], [31, 32], [33,36], [37,42], [43, 49], [50,51], [52, 55], [56, 59], [60, 60], [61, 66], [67, 76], [77, 78], [79, 79], [80, 80], [81, 81], [82, 84], [85, 85], [87, 87], [88, 88], [89, 89], [90, 90], [92, 92], [94, 94], [95, 95]]:
            for j in range(pair[0], pair[1] + 1):
                if data[i][j] == 'NA' or data[i][j] == '':
                    if average(data[i], pair[0], pair[1] + 1) == -1:
                        to_delete.add(i)
                        break
                    else:
                        data[i][j] = average(data[i], pair[0], pair[1] + 1)
                else:
                    data[i][j] = float(data[i][j])
                
    return to_delete


# imputation and record the number of tuples to be deleted
imputation_data = data.copy()
to_delete = process(imputation_data)
print("There are ", len(to_delete), " out of ", len(data) - 1, " pieces of data to be deleted.")

There are  778  out of  4600  pieces of data to be deleted.


### Step 4: Trim off unnecessary columns and data cleaning
Some columns will be deleted as they contribute nothing to the discovery such as record id.

In [139]:
data_to_use = []
for i in range(0, len(imputation_data)):
    if i in to_delete:
        continue
    else:
        data_to_use.append(imputation_data[i])

print(len(data_to_use))
print(len(data_to_use[0]))

3823
96


In [140]:
to_delete_column = [0, 1, 2, 3, 4, 5, 6, 86, 91, 93]
data_final = []
for piece in data_to_use:
    new_piece = []
    for i in range(0, len(piece)):
        if i in to_delete_column:
            continue;
        else:
            new_piece.append(piece[i])
    data_final.append(new_piece)
print(len(data_final))
print(len(data_final[0]))

3823
86


## Models

### Model 1: SVM

In [141]:
data_data = [d[:-2] for d in data_final][1:]
target_positive = [d[-2] for d in data_final][1:]
target_known_positive = [d[-1] for d in data_final][1:]

In [142]:
from sklearn.model_selection import train_test_split

# 80% training 20% testing 
X_train, X_test, y_train, y_test = train_test_split(data_data, target_positive, test_size=0.2,random_state=109) 

In [86]:
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [87]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9686274509803922


In [114]:
pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [129]:
from tabulate import tabulate

def print_importances_table(coef, names):
    yx = list(zip(clf.coef_[0], names))
    yx.sort()
    yx.reverse()
    print(tabulate(yx, headers=["Coef", "Name"]))
    
names = [d[:-2] for d in data_final][0]
print_importances_table(clf.coef_, names)

        Coef  Name
------------  ----------------------
 0.018489     physical_contact__5
 0.0184068    social_belonging__4
 0.0128968    Moral_ID__7
 0.0127846    Conspiracy_theories__4
 0.0119944    health_cond
 0.0119354    physical_hygiene__2
 0.0104806    Narcissism__2
 0.00971893   Narcissism__4
 0.00963583   morality_as_cooperat_5
 0.00888014   trait_optimism_1
 0.00859632   Narcissism__3
 0.00817583   physical_hygiene__5
 0.00807428   political_ideology
 0.00778134   policy_support__1
 0.00734784   social_belonging__1
 0.00734322   SUM_GEN
 0.00723017   collective_narcis__3
 0.00694238   trait_self.control__3
 0.00683051   Moral_ID__1
 0.00652065   physical_contact__2
 0.00543244   national_identity__1
 0.00534509   policy_support__3
 0.00505852   Moral_ID__5
 0.00435093   physical_hygiene__3
 0.00434038   Moral_ID__8
 0.00430237   open_mindedness__3
 0.00374671   morality_as_cooperat_7
 0.00351578   morality_as_cooperat_2
 0.00333359   Moral_ID__9
 0.00262206   generosity__3
 

### Model 2: Logistic Regression

In [152]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='lbfgs', max_iter=10000)
logreg.fit(X_train, y_train)
y_pred2 = logreg.predict(X_test)

In [153]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))

Accuracy: 0.9673202614379085


In [154]:
print_importances_table(logreg.coef_, names)

        Coef  Name
------------  ----------------------
 0.018489     physical_contact__5
 0.0184068    social_belonging__4
 0.0128968    Moral_ID__7
 0.0127846    Conspiracy_theories__4
 0.0119944    health_cond
 0.0119354    physical_hygiene__2
 0.0104806    Narcissism__2
 0.00971893   Narcissism__4
 0.00963583   morality_as_cooperat_5
 0.00888014   trait_optimism_1
 0.00859632   Narcissism__3
 0.00817583   physical_hygiene__5
 0.00807428   political_ideology
 0.00778134   policy_support__1
 0.00734784   social_belonging__1
 0.00734322   SUM_GEN
 0.00723017   collective_narcis__3
 0.00694238   trait_self.control__3
 0.00683051   Moral_ID__1
 0.00652065   physical_contact__2
 0.00543244   national_identity__1
 0.00534509   policy_support__3
 0.00505852   Moral_ID__5
 0.00435093   physical_hygiene__3
 0.00434038   Moral_ID__8
 0.00430237   open_mindedness__3
 0.00374671   morality_as_cooperat_7
 0.00351578   morality_as_cooperat_2
 0.00333359   Moral_ID__9
 0.00262206   generosity__3
 