# 作业1

**The future of employment** 
How susceptible are jobs to computerisation?

We examine how susceptible jobs are to computerisation. To assess this, we begin by implementing a novel methodology to estimate the probability of computerisation for 702 detailed occupations, using a Gaussian process classifier. Based on these estimates, we examine expected impacts of future computerisation on **US labour market** outcomes, with the primary objective of analysing the number of jobs at risk and the relationship between an occupations probability of computerisation, wages and educational attainment.

- C. Frey, M. Osborne The future of employment: How susceptible are jobs to computerisation? Technological Forecasting & Social Change 114 (2017) 254–280

First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not.

Second, we use objective **O*NET** variables corresponding to the defined bottlenecks to computerisation. We are interested in variables describing the level of perception and manipulation, creativity, and social intelligence required to perform it. We identified **nine variables** of O*NET that describe these attributes.

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import seaborn as sns


In [2]:
# https://github.com/SocratesAcademy/ccbook/tree/master/data/jobdata.csv
df = pd.read_csv('jobdata.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,soc,Element Name,id,label,Data Value,computerization
0,0,11-1011,Assisting and Caring for Others,70,0,2.205,0.015
1,1,11-1011,"Cramped Work Space, Awkward Positions",70,0,1.415,0.015
2,2,11-1011,Fine Arts,70,0,0.915,0.015
3,3,11-1011,Finger Dexterity,70,0,2.0,0.015
4,4,11-1011,Manual Dexterity,70,0,0.0,0.015


In [3]:
len(df)

585

In [4]:
data_list=list(df['Data Value'])
X=[]
for i in range(0,585,9):
    list1=data_list[i:i+9]
    X.append(list1)
X=np.array(X)

len(X)

65

In [6]:
X[0]

array([2.205, 1.415, 0.915, 2.   , 0.   , 3.935, 4.125, 4.44 , 3.935])

In [7]:
data_list1=list(df['label'])
Y=[]
for i in range(0,585,9):
    list1=data_list1[i]
    Y.append(list1)
Y=np.array(Y)
Y=Y[:,np.newaxis]
Y[:3]

array([[0],
       [0],
       [0]])

In [8]:
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, roc_auc_score, accuracy_score

X1, X2, y1, y2 = train_test_split(X, Y, random_state=0,
                                  train_size=0.6, test_size = 0.4)

In [9]:
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
bayes.fit(X1, y1.flatten())
y2_model = bayes.predict(X2)  
accuracy_score(y2, y2_model), roc_auc_score(y2, y2_model)

(0.8846153846153846, 0.8846153846153847)

**任务**: 使用RandomForestClassifier训练并计算accuracy_score和roc_auc_score 


**任务**: 使用SVC算法训练并计算accuracy_score和roc_auc_score

Accuracy score =  0.8846153846153846
ROC_AUC score = 0.8846153846153846


In [57]:
from sklearn.model_selection import cross_val_score

def cross_validation(model):
    roc_auc= cross_val_score(model, X, Y.flatten(), scoring="roc_auc", cv = 5)
    return roc_auc

svc_linear = SVC(kernel='linear', C=1E10)
np.mean(cross_validation(svc_linear))

0.9142857142857143

**任务**： 换用rbf的kernel来做交叉验证


0.9142857142857143

###  GPy 
The Gaussian processes framework in Python. https://github.com/SheffieldML/GPy

In [2]:
!pip install --upgrade GPy


Collecting GPy
  Using cached GPy-1.9.9-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB)
Processing /Users/datalab/Library/Caches/pip/wheels/c8/4a/0e/6e0dc85541825f991c431619e25b870d4b812c911214690cf8/paramz-0.9.5-cp37-none-any.whl
Installing collected packages: paramz, GPy
Successfully installed GPy-1.9.9 paramz-0.9.5


In [27]:
import GPy

kernel = GPy.kern.RBF(input_dim=9, variance=1., lengthscale=1.)
m = GPy.models.GPRegression(X,Y,kernel)
m.optimize(messages=False)

print(m)



Name : GP regression
Objective : 28.130643010540453
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
  [1mGP_regression.         [0;0m  |               value  |  constraints  |  priors
  [1mrbf.variance           [0;0m  |  0.3113242734729479  |      +ve      |        
  [1mrbf.lengthscale        [0;0m  |   3.933616340596464  |      +ve      |        
  [1mGaussian_noise.variance[0;0m  |  0.0964434513555219  |      +ve      |        


In [26]:
X1, X2, y1, y2 = train_test_split(X, Y, random_state=0,
                                  train_size=0.6, test_size = 0.4)
m = GPy.models.GPRegression(X1,y1,kernel)#, normalizer = True)
m.optimize(messages=False)
y2_model = m.predict(X2)[0]
y2_hat = [1 if i > 0.5 else 0  for i in y2_model ]
print('Accuracy score = ', accuracy_score(y2, y2_hat))
print('ROC_AUC score =', roc_auc_score(y2, y2_hat))

Accuracy score =  0.9230769230769231
ROC_AUC score = 0.9230769230769231


# 作业2


**Fake News Detection**
Develop a machine learning program to identify when an article might be fake news.

- train.csv: A full training dataset with the following attributes:
    - id: unique id for a news article
    - title: the title of a news article
    - author: author of the news article
    - text: the text of the article; could be incomplete
    - label: a label that marks the article as potentially unreliable
        - 1: unreliable
        - 0: reliable
- test.csv: A testing training dataset with all the same attributes at train.csv without the label.

- submit.csv: A sample submission that you can

In [29]:
df = pd.read_csv('./Day7_kaggle_fakenews/train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [31]:
df.iloc[0]['title']

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [32]:
df.shape

(20800, 5)

In [33]:
for k, i in enumerate(df['title'][:5]):
    print(k,'  -- > ',  i)

0   -- >  House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It
1   -- >  FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart
2   -- >  Why the Truth Might Get You Fired
3   -- >  15 Civilians Killed In Single US Airstrike Have Been Identified
4   -- >  Iranian woman jailed for fictional unpublished story about woman stoned to death for adultery


In [34]:
df['label'][:5]

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

In [37]:
df['text'][0][:300]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, ther'

In [38]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

df=df.fillna(' ')
df['total']=df['title']+' '+df['author']+df['text']
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1,1), 
                                   stop_words = 'english',
                                  max_features = 10000)
counts = count_vectorizer.fit_transform(df['total'].values)
tfidf = transformer.fit_transform(counts)

In [39]:
diction  = count_vectorizer.get_feature_names()

In [40]:
tfidf = [i.toarray()[0] for i in tfidf]
tfidf[:3]

[array([0., 0., 0., ..., 0., 0., 0.]),
 array([0., 0., 0., ..., 0., 0., 0.]),
 array([0., 0., 0., ..., 0., 0., 0.])]

In [41]:
y = df['label']

In [42]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(tfidf, y, 
                                                random_state=1, 
                                                train_size = 0.8)

In [45]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB() 
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

from sklearn.metrics import accuracy_score, roc_auc_score,  roc_curve, auc
accuracy_score(ytest, y_model)

0.9019230769230769

In [46]:
roc_auc_score(ytest, y_model)

0.9027959809796708

**任务**: 换另外一种算法训练并计算accuracy_score和roc_auc_score

# 作业3 （任选，可不选）

### Predicting poverty and wealth from mobile phone metadata

All other data and code, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2). https://www.openicpsr.org/openicpsr/project/100144/version/V5/view

数据和代码说明见 https://github.com/SocratesAcademy/css/issues/11