# Mother Tongues of Canada, by Knowledge of the official languages
by Grace Cowderoy

## Abstract
Canada is a nation with two official languages: English and French. According to the 2021 Statistics Canada census, 98.2% of Canadian residents speak at least one of the official languages. However, only 74.5% of the population consider one of the official languages as their first language (L1), also called their mother tongue - the language first learnt in childhood. For the other 25.6% of people, their mother tongue is a non-official language. 

The dataset Table 98-10-0175-01  Mother tongue by knowledge of official languages: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts provides data from the 2021 Statistics Canada census on the respondent’s mother tongue, age at the time of census, gender, knowledge of the two official languages, and their location within Canada. 

This study proposes to weigh the factors that influence the knowledge of each official language. Research questions include: Does living in a particular area influence a speaker of a non-official language towards English or French? Does any single mother tongue influence a speaker towards English or French? What influence does the age of the speaker have on knowledge of the official languages? 

Classification techniques will be used to identify groups that may be more likely to speak each of the official languages, both the official languages, and those more likely to speak neither official language. As the dataset is imbalanced due to more speakers of English than French, balancing techniques such as oversampling and undersampling will be used. 


## Introduction
In Canada, the term ‘bilingual’ refers to speakers of both official languages, as opposed to the more general term where a speaker knows two different languages. The distribution of these bilingual speakers is not uniform across Canada, and neither is the distribution of speakers of non-official languages. 

The population of Canada has a variety of both indigenous and non-indigenous languages as their mother tongue. Canada is home to 81 living indigenous languages (Ethnologue), with many immigrants bringing their own native language to Canada. 

There are approximately 7,000 languages currently in use in the world (Ethnologue). These languages can be structured into families. English belongs to the Germanic branch of the Indo-European family, while French belongs to the Romance branch of the same family. It has been established that generally it is easier to acquire a second language (L2) when it is closely related to the learner’s first language (L1) (Gampe, A 2021). As such, within Canada, it may reasonably be expected that those speakers with an L1 more closely related to French would have greater knowledge of French. Another factor may be the location of the individual within Canada - policies in different provinces may promote one official language over another. The age of the individual can affect their language skills, as it is generally easier to acquire a second language in childhood compared to adulthood. Further, language policy in Canada has changed over time and varies between the provinces and territories, which suggests that age may be a factor in knowledge of the official languages depending on the policies in place over time. 

This is an investigation of the dataset Statistics Canada. Table 98-10-0175-01  Mother tongue by knowledge of official languages: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts, https://doi.org/10.25318/9810017501-eng 

The dataset was released 2022-08-17 and comes from the 2021 Census of Population of Canada. Due to the recent release of the dataset, it does not yet appear to have been cited on Web of Science (As of 2023-10-07)

The dataset has multiple dimensions, including age, gender, geographic location, mother tongue of respondent, respondent's knowledge of the official languages (French and English). These dimensions include aggregates as individual records, e.g. Canada as a record, Ontario, Toronto. Part of the data cleaning will involve converting these aggregations into their tree structures. 

Importing the required libraries, bringing in the dataset motherTongues Table 98-10-0175-01 and doing the initial exploratory data analysis. 

In [1]:
#Data Handling and Modeling Tools
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from ydata_profiling import ProfileReport
from ydata_profiling.model.typeset import ProfilingTypeSet
from bigtree import dataframe_to_tree_by_relation
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split, KFold,cross_val_score,cross_validate
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, brier_score_loss, matthews_corrcoef
from skmultilearn.model_selection import iterative_train_test_split
from skmultilearn.model_selection.measures import get_combination_wise_output_matrix
import graphviz

  @nb.jit


In [2]:
#Benchmarking Tools
import pyperformance
import pyperf
import memory_profiler
from scipy.stats import ttest_ind

In [3]:
%load_ext memory_profiler

In [4]:
motherTongues_filepath = '/home/grace/Documents/CapstoneProject/MotherTongues/98100175.csv'
motherTongues_data = pd.read_csv(motherTongues_filepath)
motherTongues_data.describe()

Unnamed: 0,REF_DATE,Knowledge of official languages (5):Total - Knowledge of official languages[1],Symbol,Knowledge of official languages (5):English only[2],Symbol.1,Knowledge of official languages (5):French only[3],Symbol.2,Knowledge of official languages (5):English and French[4],Symbol.3,Knowledge of official languages (5):Neither English nor French[5],Symbol.4
count,2591730.0,2591730.0,0.0,2591730.0,0.0,2591730.0,0.0,2591730.0,0.0,2591730.0,0.0
mean,2021.0,1067.793,,749.971,,101.061,,188.1355,,28.55237,
std,0.0,65127.09,,47375.92,,10027.69,,13140.44,,1603.253,
min,2021.0,0.0,,0.0,,0.0,,0.0,,0.0,
25%,2021.0,0.0,,0.0,,0.0,,0.0,,0.0,
50%,2021.0,0.0,,0.0,,0.0,,0.0,,0.0,
75%,2021.0,0.0,,0.0,,0.0,,0.0,,0.0,
max,2021.0,36620960.0,,25261660.0,,4087895.0,,6581680.0,,689725.0,


In [5]:
print(len(motherTongues_data))
list(motherTongues_data.columns)

2591730


['REF_DATE',
 'GEO',
 'DGUID',
 'Gender (3)',
 'Age (15A)',
 'Mother tongue (331)',
 'Coordinate',
 'Knowledge of official languages (5):Total - Knowledge of official languages[1]',
 'Symbol',
 'Knowledge of official languages (5):English only[2]',
 'Symbol.1',
 'Knowledge of official languages (5):French only[3]',
 'Symbol.2',
 'Knowledge of official languages (5):English and French[4]',
 'Symbol.3',
 'Knowledge of official languages (5):Neither English nor French[5]',
 'Symbol.4']

In [6]:
motherTongues_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2591730 entries, 0 to 2591729
Data columns (total 17 columns):
 #   Column                                                                          Dtype  
---  ------                                                                          -----  
 0   REF_DATE                                                                        int64  
 1   GEO                                                                             object 
 2   DGUID                                                                           object 
 3   Gender (3)                                                                      object 
 4   Age (15A)                                                                       object 
 5   Mother tongue (331)                                                             object 
 6   Coordinate                                                                      object 
 7   Knowledge of official languages (5):Total - K

In [7]:
genderList = motherTongues_data["Gender (3)"].unique()


In [8]:
profile = ProfileReport(motherTongues_data, title = "Profiling Report", explorative=True)
##profile

In [9]:
#profile.to_file('/home/grace/Documents/CapstoneProject/MotherTongues-Canada/EDA.html')


The YData Profiling identified a number of empty features and some uniform features. These will be handled in the next steps and YData Profile will be run again. 

In [10]:
motherTongues_data.isnull().sum()

REF_DATE                                                                                0
GEO                                                                                     0
DGUID                                                                                   0
Gender (3)                                                                              0
Age (15A)                                                                               0
Mother tongue (331)                                                                     0
Coordinate                                                                              0
Knowledge of official languages (5):Total - Knowledge of official languages[1]          0
Symbol                                                                            2591730
Knowledge of official languages (5):English only[2]                                     0
Symbol.1                                                                          2591730
Knowledge 

In [11]:
df = motherTongues_data[[c for c in motherTongues_data.columns if 'Symbol' not in c]]
list(df.columns)

['REF_DATE',
 'GEO',
 'DGUID',
 'Gender (3)',
 'Age (15A)',
 'Mother tongue (331)',
 'Coordinate',
 'Knowledge of official languages (5):Total - Knowledge of official languages[1]',
 'Knowledge of official languages (5):English only[2]',
 'Knowledge of official languages (5):French only[3]',
 'Knowledge of official languages (5):English and French[4]',
 'Knowledge of official languages (5):Neither English nor French[5]']

In [12]:
df=df.loc[:,df.columns != 'REF_DATE']
df=df.loc[:,df.columns != 'DGUID']

In [13]:
df.isnull().sum()

GEO                                                                               0
Gender (3)                                                                        0
Age (15A)                                                                         0
Mother tongue (331)                                                               0
Coordinate                                                                        0
Knowledge of official languages (5):Total - Knowledge of official languages[1]    0
Knowledge of official languages (5):English only[2]                               0
Knowledge of official languages (5):French only[3]                                0
Knowledge of official languages (5):English and French[4]                         0
Knowledge of official languages (5):Neither English nor French[5]                 0
dtype: int64

In [14]:
profileDF = ProfileReport(df, title = "Profiling Report (Reduced)", explorative=True)
#profileDF.to_file('/home/grace/Documents/CapstoneProject/MotherTongues-Canada/EDA-reduced.html')

In [15]:
list(df.columns)

['GEO',
 'Gender (3)',
 'Age (15A)',
 'Mother tongue (331)',
 'Coordinate',
 'Knowledge of official languages (5):Total - Knowledge of official languages[1]',
 'Knowledge of official languages (5):English only[2]',
 'Knowledge of official languages (5):French only[3]',
 'Knowledge of official languages (5):English and French[4]',
 'Knowledge of official languages (5):Neither English nor French[5]']

The dataset includes aggregations as individual rows. The features 'Knowledge of Official languages' for Single responses (row 1, coordinate 1.1.1.2) is the sum of row 2 and row 5 - Official languages and Non-official languages respectively. Part of the data cleaning will require separating out these aggregations. 

Upon reviewing the metadata - each Geographical area has a unique member ID and is listed with its parent Member ID. 
E.g. Canada has ID 1; Nova Scotia has ID 10 and parent ID 1; and Halifax has ID 12 and parent member 10 (Nova Scotia)

Similarly for age - Total Age has ID 1; 25 to 64 years has ID 9 and parent ID 1; 25 to 34 years has ID 10 and parent ID 9. 

Mother tongues, Gender, knowledge of official languages have similar encoding available in the metadata. 

The next sections will review the tree hierarchies for the independent variables and flatten them to reduce dimensionality. 

In [16]:
df.head(n=8)

Unnamed: 0,GEO,Gender (3),Age (15A),Mother tongue (331),Coordinate,Knowledge of official languages (5):Total - Knowledge of official languages[1],Knowledge of official languages (5):English only[2],Knowledge of official languages (5):French only[3],Knowledge of official languages (5):English and French[4],Knowledge of official languages (5):Neither English nor French[5]
0,Canada,Total - Gender,Total - Age,Total - Mother tongue,1.1.1.1,36620955,25261655,4087895,6581680,689725
1,Canada,Total - Gender,Total - Age,Single responses,1.1.1.2,35145265,24306165,4029960,6130560,678580
2,Canada,Total - Gender,Total - Age,Official languages,1.1.1.3,27296445,18325325,3734010,5226490,10620
3,Canada,Total - Gender,Total - Age,English,1.1.1.4,20107200,18285580,5990,1806605,9025
4,Canada,Total - Gender,Total - Age,French,1.1.1.5,7189245,39740,3728020,3419880,1595
5,Canada,Total - Gender,Total - Age,Non-official languages,1.1.1.6,7848820,5980845,295950,904065,667955
6,Canada,Total - Gender,Total - Age,Indigenous languages,1.1.1.7,148895,123580,10995,8785,5535
7,Canada,Total - Gender,Total - Age,Algonquian languages,1.1.1.8,97125,79020,10730,5625,1760


The following section brings in the metadata files as individual dataframes. The metadata provides the id of the element and the id of the parent element. Any element that is listed with an ID that is not in the Parent ID column thus is not an aggregation. Each data element is labelled as being a parent or not. This equates to True whenever it appears in the Parent ID list and False when it does not. This will allow us to consider those values that are not aggregations. 

In [17]:
motherTonguesMeta_filepath = '/home/grace/Documents/CapstoneProject/MotherTongues-Canada/MetaData_Geography.csv'
geoData = pd.read_csv(motherTonguesMeta_filepath)

#geoData.head()
geoReduced = geoData[['Member Name','Member ID','Parent Member ID']]

isParent = geoReduced['Member ID'].isin(geoReduced['Parent Member ID'])

geoReduced['Is Parent Loc'] = isParent

#geoReduced.tail()
censusMA = geoReduced.loc[geoReduced['Is Parent Loc']== False]

censusMA.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  geoReduced['Is Parent Loc'] = isParent


Unnamed: 0,Member Name,Member ID,Parent Member ID,Is Parent Loc
2,"Corner Brook (CA), N.L.",3,2.0,False
3,"Gander (CA), N.L.",4,2.0,False
4,"Grand Falls-Windsor (CA), N.L.",5,2.0,False
5,"St. John's (CMA), N.L.",6,2.0,False
7,"Charlottetown (CA), P.E.I.",8,7.0,False


This following section shows the hierarchical structure of the GEO attribute - the geographical location within Canada. Note that CMA refers to Census Metropolitan Area while CA refers to Census Agglomeration. https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/az/Definition-eng.cfm?ID=geo009 

In [18]:
geoTree = dataframe_to_tree_by_relation(geoReduced, 'Member ID','Parent Member ID')
geoTree.show(attr_list = ['Member Name'])


1 [Member Name=Canada]
├── 2 [Member Name=Newfoundland and Labrador]
│   ├── 3 [Member Name=Corner Brook (CA), N.L.]
│   ├── 4 [Member Name=Gander (CA), N.L.]
│   ├── 5 [Member Name=Grand Falls-Windsor (CA), N.L.]
│   └── 6 [Member Name=St. John's (CMA), N.L.]
├── 7 [Member Name=Prince Edward Island]
│   ├── 8 [Member Name=Charlottetown (CA), P.E.I.]
│   └── 9 [Member Name=Summerside (CA), P.E.I.]
├── 10 [Member Name=Nova Scotia]
│   ├── 11 [Member Name=Cape Breton (CA), N.S.]
│   ├── 12 [Member Name=Halifax (CMA), N.S.]
│   ├── 13 [Member Name=Kentville (CA), N.S.]
│   ├── 14 [Member Name=New Glasgow (CA), N.S.]
│   └── 15 [Member Name=Truro (CA), N.S.]
├── 16 [Member Name=New Brunswick]
│   ├── 17 [Member Name=Bathurst (CA), N.B.]
│   ├── 18 [Member Name=Campbellton (CA), N.B./Que.]
│   │   ├── 19 [Member Name=Campbellton (New Brunswick part) (CA), N.B.]
│   │   └── 20 [Member Name=Campbellton (Quebec part) (CA), Que.]
│   ├── 21 [Member Name=Edmundston (CA), N.B.]
│   ├── 22 [Member

In [19]:
motherTonguesMeta_filepath = '/home/grace/Documents/CapstoneProject/MotherTongues-Canada/MetaData_MotherTongue.csv'
motherTongueData = pd.read_csv(motherTonguesMeta_filepath)

motherTongueData.head()
motherTongueData = motherTongueData[['Member Name','Member ID','Parent Member ID']]

isLanguage = motherTongueData['Member ID'].isin(motherTongueData['Parent Member ID'])

motherTongueData['Is Parent Lang'] = isLanguage
motherTongueData.tail(8)

Unnamed: 0,Member Name,Member ID,Parent Member ID,Is Parent Lang
323,Hungarian,324,321.0,False
324,"Other languages, n.i.e.",325,98.0,False
325,Multiple responses,326,1.0,True
326,English and French,327,326.0,False
327,English and non-official language(s),328,326.0,False
328,French and non-official language(s),329,326.0,False
329,"English, French and non-official language(s)",330,326.0,False
330,Multiple non-official languages,331,326.0,False


In [20]:
motherTongueTree = dataframe_to_tree_by_relation(motherTongueData,'Member ID','Parent Member ID')
motherTongueTree.show(attr_list = ['Member Name'])

1 [Member Name=Total - Mother tongue]
├── 2 [Member Name=Single responses]
│   ├── 3 [Member Name=Official languages]
│   │   ├── 4 [Member Name=English]
│   │   └── 5 [Member Name=French]
│   └── 6 [Member Name=Non-official languages]
│       ├── 7 [Member Name=Indigenous languages]
│       │   ├── 8 [Member Name=Algonquian languages]
│       │   │   ├── 9 [Member Name=Blackfoot]
│       │   │   ├── 10 [Member Name=Cree-Innu languages]
│       │   │   │   ├── 11 [Member Name=Atikamekw]
│       │   │   │   ├── 12 [Member Name=Cree languages]
│       │   │   │   │   ├── 13 [Member Name=Ililimowin (Moose Cree)]
│       │   │   │   │   ├── 14 [Member Name=Inu Ayimun (Southern East Cree)]
│       │   │   │   │   ├── 15 [Member Name=Iyiyiw-Ayimiwin (Northern East Cree)]
│       │   │   │   │   ├── 16 [Member Name=Nehinawewin (Swampy Cree)]
│       │   │   │   │   ├── 17 [Member Name=Nehiyawewin (Plains Cree)]
│       │   │   │   │   ├── 18 [Member Name=Nihithawiwin (Woods Cree)]
│       │  

In [21]:
motherTongueLangs = motherTongueData.loc[(motherTongueData['Is Parent Lang']== False) & (motherTongueData['Parent Member ID'] != 326)]
motherTongueLangs.head(10)

Unnamed: 0,Member Name,Member ID,Parent Member ID,Is Parent Lang
3,English,4,3.0,False
4,French,5,3.0,False
8,Blackfoot,9,8.0,False
10,Atikamekw,11,10.0,False
12,Ililimowin (Moose Cree),13,12.0,False
13,Inu Ayimun (Southern East Cree),14,12.0,False
14,Iyiyiw-Ayimiwin (Northern East Cree),15,12.0,False
15,Nehinawewin (Swampy Cree),16,12.0,False
16,Nehiyawewin (Plains Cree),17,12.0,False
17,Nihithawiwin (Woods Cree),18,12.0,False


In [22]:
motherTonguesMeta_filepath = '/home/grace/Documents/CapstoneProject/MotherTongues-Canada/MetaData_Age.csv'
motherTongueAge = pd.read_csv(motherTonguesMeta_filepath)
motherTongueAge = motherTongueAge[['Member Name','Member ID','Parent Member ID']]

isAgeGroup = motherTongueAge['Member ID'].isin(motherTongueAge['Parent Member ID'])
motherTongueAge['Is Parent Age']=isAgeGroup
#motherTongueAge.head()

In [23]:
motherTonguesMeta_filepath = '/home/grace/Documents/CapstoneProject/MotherTongues-Canada/MetaData_Gender.csv'
motherTongueGender = pd.read_csv(motherTonguesMeta_filepath)
motherTongueGender = motherTongueGender[['Member Name','Member ID','Parent Member ID']]

isGenderGroup = motherTongueGender['Member ID'].isin(motherTongueGender['Parent Member ID'])
motherTongueGender['Is Parent Gender']=isGenderGroup
motherTongueGender.head()

Unnamed: 0,Member Name,Member ID,Parent Member ID,Is Parent Gender
0,Total - Gender,1,,True
1,Men+,2,1.0,False
2,Women+,3,1.0,False


This next section takes the metadata dataframes from the previous section and joins them on the motherTongues data from earlier. This provides the option to filter out aggregations or keep only aggregations. 

In [24]:
df = df.join(motherTongueData.set_index('Member Name'), on='Mother tongue (331)', rsuffix='_lang')
df = df.join(geoReduced.set_index('Member Name'), on='GEO',rsuffix='_geo')
df = df.join(motherTongueAge.set_index('Member Name'), on='Age (15A)', rsuffix='_age')
df = df.join(motherTongueGender.set_index('Member Name'),on='Gender (3)', rsuffix='_gender')
list(df.columns)

['GEO',
 'Gender (3)',
 'Age (15A)',
 'Mother tongue (331)',
 'Coordinate',
 'Knowledge of official languages (5):Total - Knowledge of official languages[1]',
 'Knowledge of official languages (5):English only[2]',
 'Knowledge of official languages (5):French only[3]',
 'Knowledge of official languages (5):English and French[4]',
 'Knowledge of official languages (5):Neither English nor French[5]',
 'Member ID',
 'Parent Member ID',
 'Is Parent Lang',
 'Member ID_geo',
 'Parent Member ID_geo',
 'Is Parent Loc',
 'Member ID_age',
 'Parent Member ID_age',
 'Is Parent Age',
 'Member ID_gender',
 'Parent Member ID_gender',
 'Is Parent Gender']

Excluding aggregate values - those that have parent = true. Also excluding the multiple mother tongues for consistency. 

In [25]:

dfFlat = df.loc[(df['Is Parent Lang'] == False) & (df['Parent Member ID'] != 326) & (df['Is Parent Loc'] == False)& (df['Is Parent Age'] == False)& (df['Is Parent Gender'] == False)]



In [26]:
dfFlat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 911856 entries, 35420 to 2591723
Data columns (total 22 columns):
 #   Column                                                                          Non-Null Count   Dtype  
---  ------                                                                          --------------   -----  
 0   GEO                                                                             911856 non-null  object 
 1   Gender (3)                                                                      911856 non-null  object 
 2   Age (15A)                                                                       911856 non-null  object 
 3   Mother tongue (331)                                                             911856 non-null  object 
 4   Coordinate                                                                      911856 non-null  object 
 5   Knowledge of official languages (5):Total - Knowledge of official languages[1]  911856 non-null  int64  
 6  

In [27]:
dfFlatReduced = dfFlat[['GEO', 'Gender (3)', 'Age (15A)', 'Mother tongue (331)', 'Coordinate',  'Knowledge of official languages (5):English only[2]', 'Knowledge of official languages (5):French only[3]', 'Knowledge of official languages (5):English and French[4]', 'Knowledge of official languages (5):Neither English nor French[5]']]
dfFlatReduced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 911856 entries, 35420 to 2591723
Data columns (total 9 columns):
 #   Column                                                             Non-Null Count   Dtype 
---  ------                                                             --------------   ----- 
 0   GEO                                                                911856 non-null  object
 1   Gender (3)                                                         911856 non-null  object
 2   Age (15A)                                                          911856 non-null  object
 3   Mother tongue (331)                                                911856 non-null  object
 4   Coordinate                                                         911856 non-null  object
 5   Knowledge of official languages (5):English only[2]                911856 non-null  int64 
 6   Knowledge of official languages (5):French only[3]                 911856 non-null  int64 
 7   Knowledge of off

In [28]:
dfFlatReduced['Mother tongue (331)'] = dfFlatReduced['Mother tongue (331)'].astype('category')
dfFlatReduced['GEO'] = dfFlatReduced['GEO'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfFlatReduced['Mother tongue (331)'] = dfFlatReduced['Mother tongue (331)'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfFlatReduced['GEO'] = dfFlatReduced['GEO'].astype('category')


Repeating the YData Profile Report again, this time with the cleaned and flattened data. 

In [29]:
profileDF = ProfileReport(dfFlatReduced, title = "Profiling Report (Reduced)", explorative=True)
#profileDF.to_file('/home/grace/Documents/CapstoneProject/MotherTongues-Canada/EDA-Flattened-reduced.html')

Now that the data has been cleaned and flattened, we are ready to start applying machine learning models. The first step is to list which features are the target variables . We will also need to encode the categorical features as numbers; fortunately this has already been done, as the Coordinate feature is a combination of the member IDs. The string can be expanded by the deliminator. We will also divide the data into training and test sets at this time. 

In [30]:
col_namesX = list(dfFlatReduced[[c for c in dfFlatReduced.columns if 'Knowledge' not in c]].columns)
col_namesY = list(dfFlatReduced[[c for c in dfFlatReduced.columns if 'Knowledge' in c]].columns)
#dfFlatReduced['Total People'] = dfFlatReduced[col_namesY].sum(axis='columns')
#dfFlatReduced['Not Empty'] = np.where(dfFlatReduced['Total People'] >0, True, False)

dfFlatReduced.head()
#print(col_namesX, col_namesY)
X_Categories = dfFlatReduced[col_namesX]
Y_Categories = dfFlatReduced[col_namesY]


In [31]:
dfFlatReduced[col_namesY].sum(axis='columns')

35420      525
35421        0
35425        0
35427        0
35429        0
          ... 
2591718      0
2591720      0
2591721      0
2591722      0
2591723      0
Length: 911856, dtype: int64

In [32]:

X_Categories[['Geography','Gender','Age','Mother Tongue']]=X_Categories['Coordinate'].str.split('.', expand=True)
X_Categories.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_Categories[['Geography','Gender','Age','Mother Tongue']]=X_Categories['Coordinate'].str.split('.', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_Categories[['Geography','Gender','Age','Mother Tongue']]=X_Categories['Coordinate'].str.split('.', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-

Unnamed: 0,GEO,Gender (3),Age (15A),Mother tongue (331),Coordinate,Geography,Gender,Age,Mother Tongue
35420,"Corner Brook (CA), N.L.",Men+,0 to 4 years,English,3.2.3.4,3,2,3,4
35421,"Corner Brook (CA), N.L.",Men+,0 to 4 years,French,3.2.3.5,3,2,3,5
35425,"Corner Brook (CA), N.L.",Men+,0 to 4 years,Blackfoot,3.2.3.9,3,2,3,9
35427,"Corner Brook (CA), N.L.",Men+,0 to 4 years,Atikamekw,3.2.3.11,3,2,3,11
35429,"Corner Brook (CA), N.L.",Men+,0 to 4 years,Ililimowin (Moose Cree),3.2.3.13,3,2,3,13


In [33]:
X = X_Categories[['Geography','Gender','Age','Mother Tongue']]
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 911856 entries, 35420 to 2591723
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Geography      911856 non-null  object
 1   Gender         911856 non-null  object
 2   Age            911856 non-null  object
 3   Mother Tongue  911856 non-null  object
dtypes: object(4)
memory usage: 34.8+ MB


In [34]:
#mlb = MultiLabelBinarizer(classes=Y_Categories.columns)
#y = mlb.fit(Y_Categories)
#y

In [35]:
Y_Categories.head()

Unnamed: 0,Knowledge of official languages (5):English only[2],Knowledge of official languages (5):French only[3],Knowledge of official languages (5):English and French[4],Knowledge of official languages (5):Neither English nor French[5]
35420,525,0,0,0
35421,0,0,0,0
35425,0,0,0,0
35427,0,0,0,0
35429,0,0,0,0


In [36]:
'''
Y_Categories['Target Vector'] = Y_Categories.loc[:, Y_Categories.columns != 'Target Vector'].values.tolist()
Y_Categories.head()
Y = Y_Categories['Target Vector']
ts = pd.Series(pd.arrays.SparseArray(Y_Categories['Target Vector']))
ts.head()
Y = MultiLabelBinarizer()
Y.fit(Y_Categories['Target Vector'])
'''

"\nY_Categories['Target Vector'] = Y_Categories.loc[:, Y_Categories.columns != 'Target Vector'].values.tolist()\nY_Categories.head()\nY = Y_Categories['Target Vector']\nts = pd.Series(pd.arrays.SparseArray(Y_Categories['Target Vector']))\nts.head()\nY = MultiLabelBinarizer()\nY.fit(Y_Categories['Target Vector'])\n"

Originally had the test set as 30% of the output data, however this crashed when the decision_tree predictor function was run. As such, reduced the test_size down to 25%. This still gives 227,964 records for the test set. 

In [37]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y_Categories, test_size=0.25,random_state=42)
print(len(X_train),len(Y_train),len(X_test),len(Y_test))

683892 683892 227964 227964


In [38]:
Y_EngOnly_train = Y_train['Knowledge of official languages (5):English only[2]']
Y_EngOnly_test = Y_test['Knowledge of official languages (5):English only[2]']
Y_FrOnly_train = Y_train['Knowledge of official languages (5):French only[3]']
Y_FrOnly_test = Y_test['Knowledge of official languages (5):French only[3]']
Y_Bilang_train = Y_train['Knowledge of official languages (5):English and French[4]']
Y_Bilang_test = Y_test['Knowledge of official languages (5):English and French[4]']
Y_Neither_train = Y_train['Knowledge of official languages (5):Neither English nor French[5]']
Y_Neither_test = Y_test['Knowledge of official languages (5):Neither English nor French[5]']

In [39]:
X_names = list(X_train.columns)
X_train.head()

Unnamed: 0,Geography,Gender,Age,Mother Tongue
768626,52,2,13,45
594718,40,3,12,243
2052890,138,3,8,29
1665636,112,3,8,45
549509,37,3,11,50


In [40]:
%%timeit

mtTreeEngOnly = DecisionTreeClassifier(max_depth=8)
mtTreeEngOnly=mtTreeEngOnly.fit(X_train,Y_EngOnly_train)
#print("Successfully trained the decision tree...")

1.96 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [41]:
%%memit
mtTreeEngOnly = DecisionTreeClassifier(max_depth=8)
mtTreeEngOnly=mtTreeEngOnly.fit(X_train,Y_EngOnly_train)


peak memory: 2153.81 MiB, increment: 0.12 MiB


Model Training Time: 

2.2 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

peak memory: 2792.07 MiB, increment: 0.62 MiB


In [42]:
%%timeit

crossValEngOnly = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_EngOnly_train),X_train,Y_EngOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))




12.9 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
%%memit
crossValEngOnly = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_EngOnly_train),X_train,Y_EngOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))




peak memory: 3389.31 MiB, increment: 1246.51 MiB


Cross Validation Time: 

15.2 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

peak memory: 3606.84 MiB, increment: 814.77 MiB

In [44]:
#%%timeit

crossValEngOnly5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_EngOnly_train),X_train,Y_EngOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValFrOnly5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_FrOnly_train),X_train,Y_FrOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValBilang5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Bilang_train),X_train,Y_Bilang_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValNeither5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Neither_train),X_train,Y_Neither_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))




Cross-validation time on all four models: 
51.3 s ± 1.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [59]:
print("5 Fold Cross Validation")
print("Metrics for English Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValEngOnly5['test_matthews_corrcoef'].mean(), crossValEngOnly5['test_matthews_corrcoef'].std(), crossValEngOnly5['test_neg_mean_absolute_error'].mean(),crossValEngOnly5['test_neg_mean_absolute_error'].std(), crossValEngOnly5['test_neg_mean_squared_error'].mean(), crossValEngOnly5['test_neg_mean_squared_error'].std()))

print("\nMatthews Correlation ", crossValEngOnly5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValEngOnly5['test_neg_mean_absolute_error'] )
print("Mean Squared Error " ,crossValEngOnly5['test_neg_mean_squared_error'])
print("Fit time ", crossValEngOnly5['fit_time'].mean())
print("Score time ", crossValEngOnly5['score_time'].mean())
print("\n\n")
print("Metrics for French Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValFrOnly5['test_matthews_corrcoef'].mean(),crossValFrOnly5['test_matthews_corrcoef'].std(), crossValFrOnly5['test_neg_mean_absolute_error'].mean(),crossValFrOnly5['test_neg_mean_absolute_error'].std(), crossValFrOnly5['test_neg_mean_squared_error'].mean(),crossValFrOnly5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValFrOnly5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValFrOnly5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValFrOnly5['test_neg_mean_squared_error'])
print("Fit time ", crossValFrOnly5['fit_time'].mean())
print("Score time ", crossValFrOnly5['score_time'].mean())
print("\n\n")
print("Metrics for Both English and French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValBilang5['test_matthews_corrcoef'].mean(),crossValBilang5['test_matthews_corrcoef'].std(), crossValBilang5['test_neg_mean_absolute_error'].mean(),crossValBilang5['test_neg_mean_absolute_error'].std(), crossValBilang5['test_neg_mean_squared_error'].mean(), crossValBilang5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValBilang5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValBilang5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValBilang5['test_neg_mean_squared_error'])
print("Fit time ", crossValBilang5['fit_time'].mean())
print("Score time ", crossValBilang5['score_time'].mean())
print("\n\n")
print("Metrics for Neither English nor French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValNeither5['test_matthews_corrcoef'].mean(),crossValNeither5['test_matthews_corrcoef'].std(), crossValNeither5['test_neg_mean_absolute_error'].mean(),crossValNeither5['test_neg_mean_absolute_error'].std(), crossValNeither5['test_neg_mean_squared_error'].mean(),crossValNeither5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValNeither5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValNeither5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValNeither5['test_neg_mean_squared_error'])
print("Fit time ", crossValNeither5['fit_time'].mean())
print("Score time ", crossValNeither5['score_time'].mean())

5 Fold Cross Validation
Metrics for English Only 

 0.12 mean and 0.01 std Matthews Correlation Coefficient 
 -21.84 mean and 2.12 std Mean Absolute Error 
 -858305.25 mean and 288926.94 std Mean Squared Error

Matthews Correlation  [0.13137028 0.12267611 0.12899181 0.12793975 0.11096916]
Mean Absolute Error  [-25.87930165 -21.80583277 -20.42770036 -19.9342365  -21.15409642]
Mean Squared Error  [-1402241.51752096  -760660.13532779  -535209.06066765  -794159.34342511
  -799256.17149688]
Fit time  1.6151402473449707
Score time  0.5006423473358155



Metrics for French Only 

 0.23 mean and 0.01 std Matthews Correlation Coefficient 
 -2.94 mean and 0.74 std Mean Absolute Error 
 -97864.67 mean and 44123.86 std Mean Squared Error

Matthews Correlation  [0.23536456 0.22094063 0.22023311 0.22613083 0.22255916]
Mean Absolute Error  [-4.17216824 -2.15120742 -3.31957625 -2.30965506 -2.75175832]
Mean Squared Error [-183711.64378304  -70907.46496173  -94854.32324643  -64099.87497989
  -75750.0658

In [46]:

crossValEngOnly = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_EngOnly_train),X_train,Y_EngOnly_train, cv=10, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValFrOnly = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_FrOnly_train),X_train,Y_FrOnly_train, cv=10, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValBilang = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Bilang_train),X_train,Y_Bilang_train, cv=10, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValNeither = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Neither_train),X_train,Y_Neither_train, cv=10, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))




In [58]:
print("10 Fold Cross Validation")
print("Metrics for English Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" % (crossValEngOnly['test_matthews_corrcoef'].mean(), crossValEngOnly['test_matthews_corrcoef'].std(), crossValEngOnly['test_neg_mean_absolute_error'].mean(),crossValEngOnly['test_neg_mean_absolute_error'].std(), crossValEngOnly['test_neg_mean_squared_error'].mean(), crossValEngOnly['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValEngOnly['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValEngOnly['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValEngOnly['test_neg_mean_squared_error'])
print("Fit time ", crossValEngOnly['fit_time'].mean())
print("Score time ", crossValEngOnly['score_time'].mean())
print("\n\n")
print("Metrics for French Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValFrOnly['test_matthews_corrcoef'].mean(),crossValFrOnly['test_matthews_corrcoef'].std(), crossValFrOnly['test_neg_mean_absolute_error'].mean(),crossValFrOnly['test_neg_mean_absolute_error'].std(), crossValFrOnly['test_neg_mean_squared_error'].mean(),crossValFrOnly['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValFrOnly['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValFrOnly['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValFrOnly['test_neg_mean_squared_error'])
print("Fit time ", crossValFrOnly['fit_time'].mean())
print("Score time ", crossValFrOnly['score_time'].mean())
print("\n\n")
print("Metrics for Both English and French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValBilang['test_matthews_corrcoef'].mean(),crossValBilang['test_matthews_corrcoef'].std(), crossValBilang['test_neg_mean_absolute_error'].mean(),crossValBilang['test_neg_mean_absolute_error'].std(), crossValBilang['test_neg_mean_squared_error'].mean(), crossValBilang['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValBilang['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValBilang['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValBilang['test_neg_mean_squared_error'])
print("Fit time ", crossValBilang['fit_time'].mean())
print("Score time ", crossValBilang['score_time'].mean())
print("\n\n")
print("Metrics for Neither English nor French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValNeither['test_matthews_corrcoef'].mean(),crossValNeither['test_matthews_corrcoef'].std(), crossValNeither['test_neg_mean_absolute_error'].mean(),crossValNeither['test_neg_mean_absolute_error'].std(), crossValNeither['test_neg_mean_squared_error'].mean(),crossValNeither['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValNeither['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValNeither['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValNeither['test_neg_mean_squared_error'])
print("Fit time ", crossValNeither['fit_time'].mean())
print("Score time ", crossValNeither['score_time'].mean())

10 Fold Cross Validation
Metrics for English Only 

 0.13 mean and 0.01 std Matthews Correlation Coefficient 
 -21.68 mean and 3.78 std Mean Absolute Error 
 -858499.38 mean and 546261.81 std Mean Squared Error

Matthews Correlation  [0.12691609 0.12817128 0.12453711 0.13030507 0.12926106 0.10759217
 0.13324699 0.12638975 0.13491379 0.13113881]
Mean Absolute Error  [-29.68270215 -18.14790174 -23.00267587 -16.80401819 -23.42065244
 -21.82529354 -25.09789586 -17.78188013 -22.57497551 -18.48440539]
Mean Squared Error [-2109366.6171955   -333532.84215529  -682273.90223574  -323213.72808493
 -1181814.23949758  -694907.28918393 -1188193.38709442  -387940.36358186
 -1268609.42037462  -415142.02795771]
Fit time  1.8563953638076782
Score time  0.2573038816452026



Metrics for French Only 

 0.23 mean and 0.01 std Matthews Correlation Coefficient 
 -2.90 mean and 1.04 std Mean Absolute Error 
 -97869.21 mean and 76416.25 std Mean Squared Error

Matthews Correlation  [0.22917734 0.24482342 0.226

In [48]:
'''
class_values = dfFlatReduced[col_namesY].unique()
feature_names = list(X_train.columns())
dot_data = export_graphviz(mtTree, out_file=None, feature_names=)


dot_data = export_graphviz(mtTree, 
                           out_file=None, 
                           filled=True, 
                           special_characters= True, 
                           class_names=col_namesY,
                           feature_names=X_names)
graph = graphviz.Source(dot_data, format="png")
graph
'''

'\nclass_values = dfFlatReduced[col_namesY].unique()\nfeature_names = list(X_train.columns())\ndot_data = export_graphviz(mtTree, out_file=None, feature_names=)\n\n\ndot_data = export_graphviz(mtTree, \n                           out_file=None, \n                           filled=True, \n                           special_characters= True, \n                           class_names=col_namesY,\n                           feature_names=X_names)\ngraph = graphviz.Source(dot_data, format="png")\ngraph\n'

In [49]:
'''
decision_pred_EngOnly = mtTreeEngOnly.predict(X_test)
#decision_pred_EngOnly.shape
mean_absolute_error(Y_EngOnly_test, decision_pred_EngOnly)
'''

'\ndecision_pred_EngOnly = mtTreeEngOnly.predict(X_test)\n#decision_pred_EngOnly.shape\nmean_absolute_error(Y_EngOnly_test, decision_pred_EngOnly)\n'

In [50]:
Y_Categories[[ c for c in col_namesY]].max()

Knowledge of official languages (5):English only[2]                  224905
Knowledge of official languages (5):French only[3]                    97265
Knowledge of official languages (5):English and French[4]            133175
Knowledge of official languages (5):Neither English nor French[5]      9625
dtype: int64

In [56]:
ttestEngMatt = ttest_ind(crossValEngOnly5['test_matthews_corrcoef'], crossValEngOnly['test_matthews_corrcoef'])
ttestEngAbs = ttest_ind(crossValEngOnly5['test_neg_mean_absolute_error'], crossValEngOnly['test_neg_mean_absolute_error'])
ttestEngSquare = ttest_ind(crossValEngOnly5['test_neg_mean_squared_error'],crossValEngOnly['test_neg_mean_squared_error'])

print(ttestEngMatt, '\n',ttestEngAbs,'\n',ttestEngSquare, '\n')

Ttest_indResult(statistic=-0.6717608668627505, pvalue=0.5134973932761651) 
 Ttest_indResult(statistic=-0.08086934645688067, pvalue=0.9367776128697753) 
 Ttest_indResult(statistic=0.0006929270542584094, pvalue=0.9994576446804804) 



In [52]:
ttestFrMatt = ttest_ind(crossValFrOnly5['test_matthews_corrcoef'], crossValFrOnly['test_matthews_corrcoef'])
ttestFrAbs = ttest_ind(crossValFrOnly5['test_neg_mean_absolute_error'], crossValFrOnly['test_neg_mean_absolute_error'])
ttestFrSquare = ttest_ind(crossValFrOnly5['test_neg_mean_squared_error'],crossValFrOnly['test_neg_mean_squared_error'])

print(ttestFrMatt, '\n',ttestFrAbs,'\n',ttestFrSquare)

Ttest_indResult(statistic=-0.979966900004891, pvalue=0.3449885080957128) 
 Ttest_indResult(statistic=-0.07336141286180903, pvalue=0.9426352897863237) 
 Ttest_indResult(statistic=0.0001143484172197165, pvalue=0.9999104992694747)


In [53]:
ttestBiMatt = ttest_ind(crossValBilang5['test_matthews_corrcoef'], crossValBilang['test_matthews_corrcoef'])
ttestBiAbs = ttest_ind(crossValBilang5['test_neg_mean_absolute_error'], crossValBilang['test_neg_mean_absolute_error'])
ttestBiSquare = ttest_ind(crossValBilang5['test_neg_mean_squared_error'],crossValBilang['test_neg_mean_squared_error'])

print(ttestBiMatt, '\n',ttestBiAbs,'\n',ttestBiSquare)

Ttest_indResult(statistic=0.25823928764986764, pvalue=0.800265480179747) 
 Ttest_indResult(statistic=-0.080393438182895, pvalue=0.9371488057961541) 
 Ttest_indResult(statistic=-0.006790139701439177, pvalue=0.9946853883007286)


In [54]:
ttestNeiMatt = ttest_ind(crossValNeither5['test_matthews_corrcoef'], crossValNeither['test_matthews_corrcoef'])
ttestNeiAbs = ttest_ind(crossValNeither5['test_neg_mean_absolute_error'], crossValNeither['test_neg_mean_absolute_error'])
ttestNeiSquare = ttest_ind(crossValNeither5['test_neg_mean_squared_error'],crossValNeither['test_neg_mean_squared_error'])

print(ttestNeiMatt, '\n',ttestNeiAbs,'\n',ttestNeiSquare)

Ttest_indResult(statistic=-0.18007779743014046, pvalue=0.8598682402111691) 
 Ttest_indResult(statistic=-0.007539450287700514, pvalue=0.9940989176591524) 
 Ttest_indResult(statistic=-0.0006844287651044715, pvalue=0.9994642963063929)


As the pvalue for all of these is above 0.05, we do not reject the null hypothesis that these come from the same population. I.e. there is no significant difference between the 5-fold cross-validation and the 10-fold cross-validation. 

In [70]:
EngOnly = DecisionTreeRegressor(max_depth=8).fit(X_train,Y_EngOnly_train)
FrOnly = DecisionTreeRegressor(max_depth=8).fit(X_train,Y_FrOnly_train)
Bilang = DecisionTreeRegressor(max_depth=8).fit(X_train,Y_Bilang_train)
Neither = DecisionTreeRegressor(max_depth=8).fit(X_train,Y_Neither_train)

In [71]:
prEngOnly = EngOnly.predict(X_test)
prFrOnly = FrOnly.predict(X_test)
prBilang = Bilang.predict(X_test)
prNeither = Neither.predict(X_test)

In [81]:
print("Mean Absolute Error for: \nEnglish Only %0.2f \nFrench Only %0.2f \nBilingual %0.2f \nNeither %0.2f" %(mean_absolute_error(Y_EngOnly_test, prEngOnly),mean_absolute_error(Y_FrOnly_test, prFrOnly),mean_absolute_error(Y_Bilang_test, prBilang), mean_absolute_error(Y_Neither_test, prNeither)))
print('\n\n')
print("Mean Squared Error for: \nEnglish Only %0.2f \nFrench Only %0.2f \nBilingual %0.2f \nNeither %0.2f" %(mean_squared_error(Y_EngOnly_test, prEngOnly),mean_squared_error(Y_FrOnly_test, prFrOnly),mean_squared_error(Y_Bilang_test, prBilang), mean_squared_error(Y_Neither_test, prNeither)))

Mean Absolute Error for: 
English Only 15.31 
French Only 1.59 
Bilingual 3.24 
Neither 0.84



Mean Squared Error for: 
English Only 67883.19 
French Only 10583.69 
Bilingual 5556.19 
Neither 242.97


In [82]:
'''
predictions = [prEngOnly, prFrOnly, prBilang, prNeither]
actual = [Y_EngOnly_test, Y_FrOnly_test, Y_Bilang_test, Y_Neither_test]
for p in predictions: 
    print(p)
    '''

'\npredictions = [prEngOnly, prFrOnly, prBilang, prNeither]\nactual = [Y_EngOnly_test, Y_FrOnly_test, Y_Bilang_test, Y_Neither_test]\nfor p in predictions: \n    print(p)\n    '

In [59]:
'''
model = MultiOutputClassifier(LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42))
model.fit(X_train,Y_train)
'''

"\nmodel = MultiOutputClassifier(LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42))\nmodel.fit(X_train,Y_train)\n"

In [90]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y_Categories, test_size=0.25,random_state=1234)
print(len(X_train),len(Y_train),len(X_test),len(Y_test))

683892 683892 227964 227964


In [91]:
Y_EngOnly_train = Y_train['Knowledge of official languages (5):English only[2]']
Y_EngOnly_test = Y_test['Knowledge of official languages (5):English only[2]']
Y_FrOnly_train = Y_train['Knowledge of official languages (5):French only[3]']
Y_FrOnly_test = Y_test['Knowledge of official languages (5):French only[3]']
Y_Bilang_train = Y_train['Knowledge of official languages (5):English and French[4]']
Y_Bilang_test = Y_test['Knowledge of official languages (5):English and French[4]']
Y_Neither_train = Y_train['Knowledge of official languages (5):Neither English nor French[5]']
Y_Neither_test = Y_test['Knowledge of official languages (5):Neither English nor French[5]']

In [92]:
crossValEngOnly5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_EngOnly_train),X_train,Y_EngOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValFrOnly5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_FrOnly_train),X_train,Y_FrOnly_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValBilang5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Bilang_train),X_train,Y_Bilang_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))
crossValNeither5 = cross_validate(DecisionTreeClassifier(max_depth=8).fit(X_train,Y_Neither_train),X_train,Y_Neither_train, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error','matthews_corrcoef'))




In [93]:
print("5 Fold Cross Validation")
print("Metrics for English Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValEngOnly5['test_matthews_corrcoef'].mean(), crossValEngOnly5['test_matthews_corrcoef'].std(), crossValEngOnly5['test_neg_mean_absolute_error'].mean(),crossValEngOnly5['test_neg_mean_absolute_error'].std(), crossValEngOnly5['test_neg_mean_squared_error'].mean(), crossValEngOnly5['test_neg_mean_squared_error'].std()))

print("\nMatthews Correlation ", crossValEngOnly5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValEngOnly5['test_neg_mean_absolute_error'] )
print("Mean Squared Error " ,crossValEngOnly5['test_neg_mean_squared_error'])
print("Fit time ", crossValEngOnly5['fit_time'].mean())
print("Score time ", crossValEngOnly5['score_time'].mean())
print("\n\n")
print("Metrics for French Only \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValFrOnly5['test_matthews_corrcoef'].mean(),crossValFrOnly5['test_matthews_corrcoef'].std(), crossValFrOnly5['test_neg_mean_absolute_error'].mean(),crossValFrOnly5['test_neg_mean_absolute_error'].std(), crossValFrOnly5['test_neg_mean_squared_error'].mean(),crossValFrOnly5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValFrOnly5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValFrOnly5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValFrOnly5['test_neg_mean_squared_error'])
print("Fit time ", crossValFrOnly5['fit_time'].mean())
print("Score time ", crossValFrOnly5['score_time'].mean())
print("\n\n")
print("Metrics for Both English and French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValBilang5['test_matthews_corrcoef'].mean(),crossValBilang5['test_matthews_corrcoef'].std(), crossValBilang5['test_neg_mean_absolute_error'].mean(),crossValBilang5['test_neg_mean_absolute_error'].std(), crossValBilang5['test_neg_mean_squared_error'].mean(), crossValBilang5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValBilang5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValBilang5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValBilang5['test_neg_mean_squared_error'])
print("Fit time ", crossValBilang5['fit_time'].mean())
print("Score time ", crossValBilang5['score_time'].mean())
print("\n\n")
print("Metrics for Neither English nor French \n\n %0.2f mean and %0.2f std Matthews Correlation Coefficient \n %0.2f mean and %0.2f std Mean Absolute Error \n %0.2f mean and %0.2f std Mean Squared Error" 
      % (crossValNeither5['test_matthews_corrcoef'].mean(),crossValNeither5['test_matthews_corrcoef'].std(), crossValNeither5['test_neg_mean_absolute_error'].mean(),crossValNeither5['test_neg_mean_absolute_error'].std(), crossValNeither5['test_neg_mean_squared_error'].mean(),crossValNeither5['test_neg_mean_squared_error'].std()))
print("\nMatthews Correlation ", crossValNeither5['test_matthews_corrcoef'])
print("Mean Absolute Error ", crossValNeither5['test_neg_mean_absolute_error'] )
print("Mean Squared Error" ,crossValNeither5['test_neg_mean_squared_error'])
print("Fit time ", crossValNeither5['fit_time'].mean())
print("Score time ", crossValNeither5['score_time'].mean())

5 Fold Cross Validation
Metrics for English Only 

 0.13 mean and 0.00 std Matthews Correlation Coefficient 
 -19.63 mean and 3.19 std Mean Absolute Error 
 -645895.83 mean and 245332.03 std Mean Squared Error

Matthews Correlation  [0.13736868 0.1310792  0.13257515 0.13252542 0.12314824]
Mean Absolute Error  [-17.52604566 -17.94193553 -19.5789162  -25.79705801 -17.28947638]
Mean Squared Error  [ -523559.24831297  -445696.99551832  -603782.00459869 -1126243.18128646
  -530197.73538142]
Fit time  1.6069054126739502
Score time  0.46847009658813477



Metrics for French Only 

 0.24 mean and 0.01 std Matthews Correlation Coefficient 
 -2.86 mean and 0.55 std Mean Absolute Error 
 -93278.75 mean and 34798.58 std Mean Squared Error

Matthews Correlation  [0.26009135 0.2480817  0.23948608 0.22579342 0.23692184]
Mean Absolute Error  [-2.5287142  -3.72586435 -2.77135212 -2.12450102 -3.1435611 ]
Mean Squared Error [ -64462.6592898  -157896.18837687  -94373.62039217  -61283.29318311
  -88377.967

The Matthews Correlation Coefficient (MCC) is stable under a different training set. The range for the MCC is [-1,1] and the values of 0.09 to 0.25 are all weak relationships. This is noticeably a stronger relationship for French Only and for the Bilingual outputs, but still a weakly positive relationship. 

Both the Mean Absolute Error (MAE) and the Mean Squared Error (MSE) have the range (infinity,0] in the above tests, with the higher values closer to 0 indicating better prediction as the errors are smaller closer to 0. However, while the Mean Absolute Error represents the average size of the errors without considering the direction, the Mean Squared Error takes into account which direction the errors occurred in. The MSE gives high weight to large errors. In the metrics given above, while the MAE is relatively small, the MSE is large. This indicates that there is a great variance in the errors of the test set. 

