# Rate of Change Analysis

*Goal: Perform classification on rate of change (ROC) classes and evaluate hypothesized insights.*

Insights:

1. Which fields (at end year) are **most indicative** of ROC classes?

2. Which mother tongues between which *adjacent census pairs* **changed** home languages the most?

3. What **events** might influence these changes and why?

4. Can *other events* be identified which might **relate** to these changes?

5. How does random forest **compare** versus decision trees when performing this sort of analysis?

## Roadmap

1. **Research** fields between each census years to determine "equivalency classes" of languages

2. **Calculate** change classes for each mother tongue (mother tongue and possibly home language should be dropped from the dataset)

    * Old mother tongue keep rate = Weighted (HLNP == MTNP) / Weighted All_MTNP
    * New mother tongue keep rate = Weighted (HLANO == MTNNO) / Weighted All_MTNNO
    * Difference between new and old keep rate should be compared to caluclate a *change class*

3. **Preprocess** data, dropping irrelevant fields and discretizing as needed

4. Perform **random forest** and/or decision tree algorithms

5. Perform **event-based** analysis

## Data Preprocessing

### Manual Feature Selection

For ease of loading, relevant features should be hand-selected and saved.

Broader analysis may be performed at a later date to identify all relevant features.

In [1]:
import pandas as pd

In [2]:
fields = ['PPSORT', 'WEIGHT', 'ABOID', 'BFNMEMB', 'CFSTAT', 'Citizen', 'CMA',
       'DETH123', 'HHTYPE',
       'MarStH', 'PR', 'PRIHM', 'REGIND', 'Sex', 'SHELCO',
       'AGEGRP', 'ATTSCH', 'BedRm', 'CFInc', 'CFInc_AT', 'CfSize',
       'CIP2011', 'CIP2011_STEM_SUM', 'CONDO', 'DPGRSUM', 'DTYPE',
       'EFDecile', 'EfDIMBM', 'EFInc', 'EFInc_AT', 'EfSize', 'ETHDER',
       'GENSTAT', 'HCORENEED_IND', 'HDGREE', 'HHInc', 'HHInc_AT',
       'HHMRKINC', 'HHSIZE', 'HLANO', 'IMMCAT5', 'IMMSTAT',
       'LFACT', 'LICO', 'LICO_AT', 'LOC_ST_RES', 'LoLIMA', 'LoLIMB',
       'LoMBM', 'LSTWRK', 'MOB1', 'Mob5', 'MrkInc',
       'MTNNO', 'NOS', 'PKID0_1', 'PKID15_24', 'PKID2_5', 'PKID25',
       'PKID6_14', 'PKIDS', 'POB', 'POBF', 'POBM', 'PR1', 'PR5',
       'PresMortG', 'REPAIR', 'ROOMS', 'SSGRAD', 'Tenur', 'TotInc',
       'TotInc_AT', 'VALUE', 'VisMin', 'WRKACT']

In [3]:
dataset_2016 = pd.read_csv('pumf-98M0001-E-2016-individuals_F1.csv')
dataset_1991 = pd.read_csv('pumf-95M0007-E-1991-individuals_F1.csv')

fields_2016 = set(dataset_2016.columns)
fields_1991 = set(dataset_1991.columns)

print(f'Manually selected fields: {len(fields)}')
print(f'2016 fields: {len(fields_2016)}')
print(f'1991 fields: {len(fields_1991)}')

intersected_fields = set(fields).intersection(fields_2016)
intersected_fields = intersected_fields.intersection(fields_1991)

print(f'Intersected fields: {len(intersected_fields)}')

dataset_2016[intersected_fields].to_csv('pumf-2016-selected-features.csv', index=False)
dataset_1991[intersected_fields].to_csv('pumf-1991-selected-features.csv', index=False)

Manually selected fields: 76
2016 fields: 141
1991 fields: 119
Intersected fields: 0


In [6]:
for field in fields:
    print(field in dataset_2016)

True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True


### Loading

In [4]:
start_year = pd.read_csv()

Unnamed: 0,PPSORT,WEIGHT,ABOID,BFNMEMB,CFSTAT,Citizen,CMA,DETH123,HHTYPE,MarStH,...,PresMortG,REPAIR,ROOMS,SSGRAD,Tenur,TotInc,TotInc_AT,VALUE,VisMin,WRKACT
0,453141,37.037277,6,0,2,1,505,1,2,2,...,1,1,11,5,1,97000,73000,450000,13,11
1,923226,37.037277,6,0,4,1,505,2,2,1,...,1,1,11,99,1,99999999,99999999,440000,13,99
2,385097,37.037277,6,0,4,1,505,2,2,1,...,1,1,11,99,1,99999999,99999999,440000,13,99
3,732612,37.037277,6,0,2,1,999,2,2,3,...,1,2,8,8,1,46000,41000,839779,13,11
4,143665,37.120914,6,0,1,1,999,1,1,3,...,1,3,5,6,1,30000,26000,60000,13,9
5,459269,37.037277,6,0,1,1,999,2,1,2,...,0,2,7,4,1,82000,69000,839779,13,1
6,632419,37.037277,6,0,1,1,999,2,1,2,...,0,2,7,6,1,51000,45000,839779,13,1
7,284347,37.019784,6,0,6,1,825,2,8,6,...,0,1,5,6,1,41000,37000,310000,13,1
8,52611,37.041550,6,0,1,1,421,1,1,2,...,9,1,4,5,2,53000,43000,99999999,13,11
9,36927,37.037277,6,0,4,1,532,1,2,1,...,1,1,10,4,1,27000,26000,640000,13,7


In [43]:
motherTongueDict = {}
motherTongueDict['1'] = 'No non-official languages'
motherTongueDict['2'] = 'Aboriginal languages'
motherTongueDict['3'] = 'Arabic'
motherTongueDict['4'] = 'Mandarin'
motherTongueDict['5'] = 'Cantonese'
motherTongueDict['6'] = 'Chinese languages'
motherTongueDict['7'] = 'German'
motherTongueDict['8'] = 'All other languages' #'Other Germanic languages'
motherTongueDict['9'] = 'All other languages' #'Greek'
motherTongueDict['10'] = 'Urdu'
motherTongueDict['11'] = 'Persian (Farsi)'
motherTongueDict['12'] = 'Other Indo-Iranian languages'
motherTongueDict['13'] = 'Italian'
motherTongueDict['14'] = 'Polish'
motherTongueDict['15'] = 'Portuguese'
motherTongueDict['16'] = 'Punjabi (Panjabi)'
motherTongueDict['17'] = 'Spanish'
motherTongueDict['18'] = 'All other languages' #'Ukrainian'
motherTongueDict['19'] = 'All other languages' #'Vietnamese'
motherTongueDict['20'] = 'Austro-Asiatic languages'
motherTongueDict['21'] = 'Other European languages'
motherTongueDict['22'] = 'Russian'
motherTongueDict['23'] = 'All other languages' #'Other Slavic languages'
motherTongueDict['24'] = 'All other languages' #'Uralic languages'
motherTongueDict['25'] = 'Other Afro-Asiatic and African languages' #'Other Afro-Asiatic languages'
motherTongueDict['26'] = 'Tamil'
motherTongueDict['27'] = 'All other languages' #'Other Dravidian languages'
motherTongueDict['28'] = 'All other languages' #'Korean'
motherTongueDict['29'] = 'Other East and Southeast Asian languages'
motherTongueDict['30'] = 'Tagalog (Pilipino, Filipino)'
motherTongueDict['31'] = 'All other languages' #'Niger-Congo languages and other African languages'
motherTongueDict['32'] = 'All other languages' #'All other single languages'
motherTongueDict['88'] = 'Not available'

In [44]:
homeLanguageDict = {}
homeLanguageDict['1'] = 'No non-official languages'
homeLanguageDict['2'] = 'Aboriginal languages'
homeLanguageDict['3'] = 'Italian'
homeLanguageDict['4'] = 'Spanish'
homeLanguageDict['5'] = 'Portugese'
homeLanguageDict['6'] = 'German'
homeLanguageDict['7'] = 'Russian'
homeLanguageDict['8'] = 'Polish'
homeLanguageDict['9'] = 'Slavic languages'
homeLanguageDict['10'] = 'Other European languages'
homeLanguageDict['11'] = 'Arabic'
homeLanguageDict['12'] = 'Other Afro-Asiatic and African languages'
homeLanguageDict['13'] = 'Punjabi (Panjabi)'
homeLanguageDict['14'] = 'Urdu'
homeLanguageDict['15'] = 'Persian (Farsi)'
homeLanguageDict['16'] = 'Other Indo-Iranian languages'
homeLanguageDict['17'] = 'Cantonese'
homeLanguageDict['18'] = 'Mandarin'
homeLanguageDict['19'] = 'Chinese languages'
homeLanguageDict['20'] = 'Austro-Asiatic languages'
homeLanguageDict['21'] = 'Tagalog (Pilipino, Filipino)'
homeLanguageDict['22'] = 'Other East and Southeast Asian languages'
homeLanguageDict['23'] = 'Tamil'
homeLanguageDict['24'] = 'All other languages'
homeLanguageDict['88'] = 'Not available'

In [45]:
# dataset = dataset[dataset.HLANO != 88]
# dataset = dataset[dataset.MTNNO != 88]
# class label: home language part A - first language write in component
homeLang = dataset['HLANO']
# class label: mother tongue part A - first language write in component
motherTongue = dataset['MTNNO']
fields.remove('HLANO')
fields.remove('MTNNO')
homeLanguageMapped = homeLang.apply(lambda x: homeLanguageDict[str(x)])
motherTongueMapped = motherTongue.apply(lambda x: motherTongueDict[str(x)])
languageShift = homeLanguageMapped.ne(motherTongueMapped)

fields.remove('WEIGHT')
x = dataset[fields]
weights = dataset["WEIGHT"]

In [46]:
homeLanguageMapped

0         No non-official languages
1         No non-official languages
2         No non-official languages
3         No non-official languages
4         No non-official languages
                    ...            
930416    No non-official languages
930417    No non-official languages
930418    No non-official languages
930419                    Cantonese
930420    No non-official languages
Name: HLANO, Length: 930421, dtype: object

In [47]:
motherTongueMapped

0         No non-official languages
1         No non-official languages
2         No non-official languages
3         No non-official languages
4         No non-official languages
                    ...            
930416    No non-official languages
930417    No non-official languages
930418    No non-official languages
930419                    Cantonese
930420    No non-official languages
Name: MTNNO, Length: 930421, dtype: object

In [48]:
languageShift

0         False
1         False
2         False
3         False
4         False
          ...  
930416    False
930417    False
930418    False
930419    False
930420    False
Length: 930421, dtype: bool

In [49]:
x

Unnamed: 0,PPSORT,ABOID,BFNMEMB,CFSTAT,Citizen,CMA,DETH123,HHTYPE,MarStH,PR,...,PresMortG,REPAIR,ROOMS,SSGRAD,Tenur,TotInc,TotInc_AT,VALUE,VisMin,WRKACT
0,453141,6,0,2,1,505,1,2,2,35,...,1,1,11,5,1,97000,73000,450000,13,11
1,923226,6,0,4,1,505,2,2,1,35,...,1,1,11,4,1,49071,39759,440000,13,11
2,385097,6,0,4,1,505,2,2,1,35,...,1,1,11,4,1,49071,39759,440000,13,11
3,732612,6,0,2,1,999,2,2,3,35,...,1,2,8,8,1,46000,41000,839779,13,11
4,143665,6,0,1,1,999,1,1,3,11,...,1,3,5,6,1,30000,26000,60000,13,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930416,700854,6,0,1,1,535,2,1,2,35,...,0,1,11,8,1,120000,94000,810000,13,11
930417,821443,6,0,2,1,541,1,7,2,35,...,8,1,5,2,1,77000,61000,494190,13,11
930418,116531,6,0,7,1,988,1,9,1,59,...,1,2,5,6,1,80000,62000,710000,13,9
930419,499993,6,0,3,2,535,1,5,6,35,...,1,2,7,1,2,22000,23000,494190,2,1


In [50]:
weights

0         37.037277
1         37.037277
2         37.037277
3         37.037277
4         37.120914
            ...    
930416    37.037277
930417    37.037277
930418    37.042280
930419    37.037277
930420    37.037277
Name: WEIGHT, Length: 930421, dtype: float64

In [51]:
from sklearn.ensemble import RandomForestClassifier

In [52]:
classifier = RandomForestClassifier(n_estimators=20, random_state=0)

In [53]:
from sklearn.model_selection import train_test_split
weights_train, weights_test, x_train, x_test, languageShift_train, languageShift_test = train_test_split(weights, x, languageShift)

In [54]:
classifier.fit(x_train, languageShift_train, weights_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [55]:
languageShift_pred = classifier.predict(x_test)

In [56]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

The Confusion Matrix indicates, given the predicted language (row names), for how many records did the algorith match the actual language. The classification report and accuracy score are other ways to evaluate how well the classification algorithm performed.

In [57]:
confusionMatrixLanguageShift = pd.DataFrame(
    confusion_matrix(languageShift_test,languageShift_pred), 
    index=['False', 'True'], 
    columns=['False', 'True'], 
)

In [58]:
print(confusionMatrixLanguageShift)
print(classification_report(languageShift_test,languageShift_pred,target_names=['False', 'True']))
print(accuracy_score(languageShift_test, languageShift_pred))

        False   True
False  199910   4631
True    17776  10289
              precision    recall  f1-score   support

       False       0.92      0.98      0.95    204541
        True       0.69      0.37      0.48     28065

    accuracy                           0.90    232606
   macro avg       0.80      0.67      0.71    232606
weighted avg       0.89      0.90      0.89    232606

0.9036697247706422


In [59]:
!pip install tabulate



In [60]:
from tabulate import tabulate
importances = classifier.feature_importances_
headers = ["name", "score"]
values = sorted(zip(x_train.columns, classifier.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))

name                    score
POB               0.0594484
ETHDER            0.0534158
POBM              0.0450321
POBF              0.0403949
IMMSTAT           0.0374875
PPSORT            0.0367282
GENSTAT           0.0333007
SHELCO            0.0278307
VALUE             0.0265353
IMMCAT5           0.0254839
AGEGRP            0.0229826
TotInc            0.0228719
TotInc_AT         0.0227842
VisMin            0.0218239
MrkInc            0.0216965
HHMRKINC          0.0203217
ROOMS             0.0199808
DPGRSUM           0.018883
EfDIMBM           0.0184105
CFInc             0.0173546
CMA               0.0168228
CFInc_AT          0.0168056
HHInc             0.0164851
HHInc_AT          0.0163105
Citizen           0.0161591
EFInc             0.0158145
EFInc_AT          0.0155211
HDGREE            0.0137588
BedRm             0.0122606
SSGRAD            0.0122191
WRKACT            0.0120619
CIP2011           0.0116925
CIP2011_STEM_SUM  0.0114265
EFDecile          0.0113859
PR1               0

In [61]:
from sklearn.tree import export_graphviz
estimator = classifier.estimators_[5]
# Creates dot file named randomForestTreeSample.dot
export_graphviz(estimator, 
                out_file='randomForestTreeSample.dot', 
                feature_names = list(x.columns),
                class_names = ['False', 'True'],
                rounded = True, proportion = False, 
                precision = 2, filled = True,
                max_depth=5)