# Assignment 2

## Instructions - Read this first!

This is an individual homework assignment. This means that:

- You may discuss the problems in this assignment with other students in this course and your instructor/TA, but YOUR WORK MUST BE YOUR OWN.
- Do not show other students code or your own work on this assignment.
- You may consult external references, but not actively receive help from individuals not involved in this course.
- Cite all references outside of the course you used, including conversations with other students which were helpful. (This helps us give credit where it is due!). All references must use a commonly accepted reference format, for example, APA or IEEE (or another citation style of your choice).

If any of these rules seem ambiguous, please check with with your instructor for help interpreting them.

We suggest completing this assignment using the provided notebook. Each question should be answered using a SQL query (or combination or SQL queries) unless the text indicates that you may do something else. You may submit your queries embedded in Python, using SQLAlchemy or the MySQL Connector, or as plain text in Markdown.

## When you submit your work

Your submission will be graded manually. To ensure that everything goes smoothly, please follow these instructions to prepare your notebook for submission to the D2L Dropbox for Assignment 2:

- Please remove any print statments used to test your work (you can comment them out)
- Please provide your solutions where asked; please do not alter any other parts of this notebook.
- If you need to add cells to test your code please move them to the end of the notebook before submission- or you may included your commented out answers and tests in the cells provided

## Introduction

 In this assignment, we will focus familiarizing you with using SQL for data exploration, and continuing to cultivate a sense of curiosity about the datasets you encounter. We will be using a table generated by Statistics Canada (2022), based upon census data, about the knowledge of languages by area in Canada. This table has been pre-cleaned (although potentially not entirely) so that you can work on the actual assignment more quickly.
 
 To begin, start by importing the provided CSV into your own SQL database using SQLAlchemy, by filling in the lines below:

In [1]:
import pandas as pd #Import pandas as pd

cpl_locations = pd.read_csv("language_speakers.csv") #Read the "language_speakers.csv" file

cpl_locations = cpl_locations.set_axis(['REFDATE', 'GEO', 'DGUID', 'STATISTICS', 'KOL', 'SMRKOL', 'SCALARFACTOR', 'SCALARID', 'COORDINATE', 'VALUE'], axis=1, inplace=False)
#Re-named the columns to make it easier to read into SQL
cpl_locations.head() #Get the first 5 rows from the data


Unnamed: 0,REFDATE,GEO,DGUID,STATISTICS,KOL,SMRKOL,SCALARFACTOR,SCALARID,COORDINATE,VALUE
0,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Total - Single and multiple responses of knowl...,units,0,1.1.1.1.1.1,36328480
1,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Single responses of knowledge of languages,units,0,1.1.1.1.1.2,21377910
2,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Multiple responses of knowledge of languages,units,0,1.1.1.1.1.3,14950565
3,2021,Canada,2021A000011124,Count,Official languages,Total - Single and multiple responses of knowl...,units,0,1.1.1.1.2.1,35644655
4,2021,Canada,2021A000011124,Count,Official languages,Single responses of knowledge of languages,units,0,1.1.1.1.2.2,20797495


In [2]:
import mysql.connector
from mysql.connector import errorcode

myconnection = mysql.connector.connect(user='khizer_kamran1', password='8VCW81ULC',
                                 host='datasciencedb2.ucalgary.ca', database='khizer_kamran1')
myconnection 

<mysql.connector.connection_cext.CMySQLConnection at 0x7f815b76b610>

In [3]:
# CREATE TABLE STATEMENT
create_statement = '''create table language_speakers (
    REFDATE int,
    GEO varchar(100),
    DGUID varchar(40),
    STATISTICS varchar(5),
    KOL varchar(100),
    SMRKOL varchar(100),
    SCALARFACTOR varchar(10),
    SCALARID int,
    COORDINATE varchar(100),
    VALUE int
    );'''

# now we'll create a cursor and run our create statement
create_cursor = myconnection.cursor()
try:
    create_cursor.execute(create_statement)
except mysql.connector.Error as err:
    if err.errno == errorcode.ER_TABLE_EXISTS_ERROR:
        print("Ooops! We already have that table")
    else:
        print(err.msg)
else:
    print("table created successfully!")

create_cursor.close()

Ooops! We already have that table


True

In [4]:
insertCursor = myconnection.cursor()

columnString = "`,`".join([str(currentColumn) for currentColumn in cpl_locations.columns.tolist()])
#print (columnString)

# inserting rows one by one from the DataFrame is sufficient for now
for i, currentRow in cpl_locations.iterrows():
    print (tuple(currentRow))
    insertCommand = "INSERT INTO `language_speakers` (`" + columnString + "`) VALUES (" + "%s,"*(len(currentRow)-1) + "%s)"
    print (insertCommand)
    insertCursor.execute(insertCommand, tuple(currentRow))
    
myconnection.commit()

insertCursor.close()

(2021, 'Canada', '2021A000011124', 'Count', 'Total - Knowledge of languages', 'Total - Single and multiple responses of knowledge of languages', 'units', 0, '1.1.1.1.1.1', 36328480)
INSERT INTO `language_speakers` (`REFDATE`,`GEO`,`DGUID`,`STATISTICS`,`KOL`,`SMRKOL`,`SCALARFACTOR`,`SCALARID`,`COORDINATE`,`VALUE`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
(2021, 'Canada', '2021A000011124', 'Count', 'Total - Knowledge of languages', 'Single responses of knowledge of languages', 'units', 0, '1.1.1.1.1.2', 21377910)
INSERT INTO `language_speakers` (`REFDATE`,`GEO`,`DGUID`,`STATISTICS`,`KOL`,`SMRKOL`,`SCALARFACTOR`,`SCALARID`,`COORDINATE`,`VALUE`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
(2021, 'Canada', '2021A000011124', 'Count', 'Total - Knowledge of languages', 'Multiple responses of knowledge of languages', 'units', 0, '1.1.1.1.1.3', 14950565)
INSERT INTO `language_speakers` (`REFDATE`,`GEO`,`DGUID`,`STATISTICS`,`KOL`,`SMRKOL`,`SCALARFACTOR`,`SCALARID`,`COORDINATE`,`VALUE`) VALUES (%s,%s,%s,

True

In [5]:
import sqlalchemy as sq #Import all of the used libraries
import pandas as pd
import mysql.connector
from mysql.connector import errorcode

from urllib.parse import quote_plus
engine = sq.create_engine('mysql+mysqlconnector://khizer_kamran1:%s@datasciencedb2.ucalgary.ca/khizer_kamran1' % quote_plus("8VCW81ULC")) #Create the engine object
library_df = pd.read_sql_table("language_speakers", engine) #Read the dataframe table
library_df.head()

Unnamed: 0,REFDATE,GEO,DGUID,STATISTICS,KOL,SMRKOL,SCALARFACTOR,SCALARID,COORDINATE,VALUE
0,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Total - Single and multiple responses of knowl...,units,0,1.1.1.1.1.1,36328480
1,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Single responses of knowledge of languages,units,0,1.1.1.1.1.2,21377910
2,2021,Canada,2021A000011124,Count,Total - Knowledge of languages,Multiple responses of knowledge of languages,units,0,1.1.1.1.1.3,14950565
3,2021,Canada,2021A000011124,Count,Official languages,Total - Single and multiple responses of knowl...,units,0,1.1.1.1.2.1,35644655
4,2021,Canada,2021A000011124,Count,Official languages,Single responses of knowledge of languages,units,0,1.1.1.1.2.2,20797495


I re-named the column headers to make it easier to insert into my SQL table.

## Part A: Warm-up Questions (12 marks)

Answer the questions below, including the queries you used where necessary. Not all questions will require writing a SQL query to answer.

**(1 mark)**

How many records are there in total?

In [6]:
query1_table = pd.read_sql_query('SELECT COUNT(COORDINATE) from language_speakers;', engine) #Create the first query
query1_table.head()

Unnamed: 0,COUNT(COORDINATE)
0,15072


There are a total of 7536 records in the table.

**(1 mark)**

How many different languages are spoken in Canada?

In [7]:
query1_table = pd.read_sql_query('SELECT DISTINCT KOL from language_speakers WHERE KOL NOT LIKE "%languages%";', engine) #Create the first query
query1_table

Unnamed: 0,KOL
0,English
1,French
2,Blackfoot
3,Atikamekw
4,Ililimowin (Moose Cree)
...,...
224,Uyghur
225,Uzbek
226,Estonian
227,Finnish


NOTE: I removed all rows that were total categories as well as any rows that contained the values of various totaled individual language categories; this way we can get the most
accurate number that only refers to the unique languages themselves.

There are 229 unique languages spoken across Canada.

**(1 mark)**

Is there data for all languages, for each province or territory?

In [8]:
query1_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL from language_speakers WHERE KOL NOT LIKE "%languages%" AND GEO not in ("Canada");', engine) #Create the first query
query1_table

Unnamed: 0,GEO,KOL
0,Newfoundland and Labrador,English
1,Newfoundland and Labrador,French
2,Newfoundland and Labrador,Nehiyawewin (Plains Cree)
3,Newfoundland and Labrador,Nihithawiwin (Woods Cree)
4,Newfoundland and Labrador,"Cree, n.o.s."
...,...,...
1888,Nunavut,American Sign Language
1889,Nunavut,Mandarin
1890,Nunavut,"Min Nan (Chaochow, Teochow, Fukien, Taiwanese)"
1891,Nunavut,Yue (Cantonese)


We know that if every province or territory had records of all of the unique languages then the total count should be (13 x 229) 2,977; however, since we got 1,893 this implies that some provinces do not have records of all unique languages.

**(8 marks)**

Explain what each of the following columns is used for. You may use the original page to guide your explanation (but you should cite it)
* GEO: The province, territory, or metropolitan area in which individuals speak a particular language within Canada.

* Statistics: Refers to the count methodology used to count of all categories of individuals; those that speak English (only and in combination with other languages), French (only and in combination with other languages), and the unique counts of the other diverse languages spoken by individuals that do not constitute English nor French (but are spoken separately or in conjunction with Canada's official languages).

* Knowledge of languages: Refers to the language a person can speak in (in normal conversation) whether in English only, French only, in both or in neither language. For children who are still learning to speak, this would also include languages that children are learning to speak with their family members at home.

* Single and multiple responses of knowledge of languages: Refers to whether the person can conduct a conversation in languages other than Canada's official languages of English and French. For children still learning to speak, this would also include languages that the child is learning to speak with their family members at home. (Number of languages being reported may vary).

* COORDINATE: This variable refers to the various placement and categories associated with the data, for example, the COORDINATE 1.1.1.1.1.1 refers to the following categories of the data 'GEOGRAPHY.TotalAGE.TotalGENDER.StatisticsCount.Knowledgeoflanguage.SingleAndMultipleResponsesKnowledgeoflanguage' (used specifically for sub-sectioning the various categories found in the data).

* VALUE: The corresponding totals that are associated with the counts of all categories of individuals; those that speak English (only and in combination with other languages), French (only and in combination with other languages), and the unique counts of the other diverse languages spoken by individuals that do not constitute English nor French (but are spoken separately or in conjunction with Canada's official languages).

References:

Statistics Canada (Government of Canada). 2022. Knowledge of languages by age and gender: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts. Retrieved October 15, 2022 from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021701

**(1 mark)**

Is it possible to tell how many individuals live in Canada based on this dataset? Why or why not?

The dataset provided would accurately allow us to predict the number of people living within Canada, for the primary reason that this dataset was collected on the basis of two separate survey forms (a long and short questionaire). We have all information from all the various households within Canada. 25% of private households received the longer questionaire and the remaining population demographics received the shorter questionaire. As stated within the source of the dataset "The Census of Population is a reliable basis for the estimation of the population of the provinces, territories and municipal areas. These counts are essential for maintaining Canada's equitable representation, as they are used to set electoral boundaries; estimate the demand for services in minority official languages; and calculate federal, provincial and territorial transfer payments." (Statistics Canada, 2022).
This direct statement from the dataset's provider indicates that the reliability of the information collected is quite high and simultaneously any inferences or estimations made on the dataset's populations would be accurate as well.

References:

Statistics Canada (Government of Canada). 2022. Knowledge of languages by age and gender: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts. Retrieved October 15, 2022 from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021701

## Part B: Simple questions (10 marks) 

For these queries, run a query which provides the answer.

**(2 marks)**

Which province or territory has the highest number of individuals who can speak multiple languages?

In [9]:
query1_table = pd.read_sql_query('SELECT GEO, MAX(VALUE) from language_speakers WHERE SMRKOL = "Multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Ontario";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,MAX(VALUE)
0,Ontario,5789460


**(2 marks)**

Which provinces or territories has the most diverse number of languages spoken?

In [10]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL from language_speakers WHERE KOL NOT LIKE "%languages%" AND GEO not in ("Canada") AND GEO = "British Columbia" ORDER BY GEO;', engine) #Create the first query
query2_table

Unnamed: 0,GEO,KOL
0,British Columbia,English
1,British Columbia,French
2,British Columbia,Blackfoot
3,British Columbia,Atikamekw
4,British Columbia,Iyiyiw-Ayimiwin (Northern East Cree)
...,...,...
213,British Columbia,Uyghur
214,British Columbia,Uzbek
215,British Columbia,Estonian
216,British Columbia,Finnish


The province with the most unique languages spoken is British Columbia.

**(2 marks)**

Which provinces or territories have the most diverse number of Indigenous languages spoken?

In [11]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL from language_speakers WHERE GEO = "British Columbia" AND GEO not in ("Canada") AND KOL NOT LIKE "%languages%" AND KOL IN ("Blackfoot", "Atikamekw", "Ililimowin (Moose Cree)", "Inu Ayimun (Southern East Cree)", "Iyiyiw-Ayimiwin (Northern East Cree)", "Nehinawewin (Swampy Cree)", "Nehiyawewin (Plains Cree)", "Nihithawiwin (Woods Cree)", "Cree, n.o.s.", "Innu (Montagnais)", "Naskapi", "Mi kmaq", "Wolastoqewi (Malecite)", "Anicinabemowin (Algonquin)", "Oji-Cree", "Anishinaabemowin (Chippewa)", "Daawaamwin (Odawa)", "Saulteau (Western Ojibway)", "Ojibway, n.o.s.", "Dakelh (Carrier)", "Dane-zaa (Beaver)", "Dene, n.o.s.", "Gwich in", "Deh Gah Ghotie Zhatie (South Slavey)", "Satuotine Yati (North Slavey)", "Slavey, n.o.s.", "Kaska (Nahani)","Tahltan", "Tlicho (Dogrib)", "Tse khene (Sekani)", "Tsilhqot in (Chilcotin)", "Tsuu Tina (Sarsi)", "Northern Tutchone", "Southern Tutchone", "Tutchone, n.o.s.", "Wetsuwet en-Babine", "Tlingit", "Haida", "Inuinnaqtun (Inuvialuktun)", "Inuinnaqtun", "Inuvialuktun", "Inuktitut", "Cayuga", "Mohawk", "Oneida", "Ktunaxa (Kutenai)", "Michif", "Halkomelem", "Lillooet", "Ntlakapamux (Thompson)", "Secwepemctsin (Shuswap)", "Squamish", "Straits", "Syilx (Okanagan)", "Assiniboine", "Dakota", "Stoney", "Gitxsan (Gitksan)", "Nisga a", "Tsimshian", "Haisla", "Heiltsuk", "Kwak wala (Kwakiutl)", "Nuu-chah-nulth (Nootka)");' , engine) #Create the first query
query2_table

Unnamed: 0,GEO,KOL
0,British Columbia,Blackfoot
1,British Columbia,Atikamekw
2,British Columbia,Iyiyiw-Ayimiwin (Northern East Cree)
3,British Columbia,Nehinawewin (Swampy Cree)
4,British Columbia,Nehiyawewin (Plains Cree)
5,British Columbia,Nihithawiwin (Woods Cree)
6,British Columbia,"Cree, n.o.s."
7,British Columbia,Innu (Montagnais)
8,British Columbia,Anicinabemowin (Algonquin)
9,British Columbia,Oji-Cree


The province that has the most amount of unique Indigenous languages spoken is British Columbia (47 records available), the Indigenous languages were found from the 229 row dataset I have filtered.

**(2 marks)**

As a percentage of all language speakers in each province or territory, how commonly is French spoken?

In [12]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Alberta" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Alberta" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Alberta is: ", (260415 / 5852310.0) * 100, "%")

The percentage of French speakers in Alberta is:  4.449781368382741 %


In [13]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "British Columbia" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "British Columbia" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in British Columbia is: ", (327350 / 7260710.0) * 100, "%")

The percentage of French speakers in British Columbia is:  4.508512252933942 %


In [14]:
# query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Ontario" AND GEO not in ("Canada") AND KOL = "French";', engine)
# query2_table

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Ontario" AND GEO not in ("Canada");', engine)
query3_table

print("The percentage of French speakers in Ontario is: ", (1550545 / 21463400.0) * 100, "%")

The percentage of French speakers in Ontario is:  7.224135039182982 %


In [15]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "New Brunswick" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "New Brunswick" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in New Brunswick is: ", (317825 / 1079265.0) * 100, "%")

The percentage of French speakers in New Brunswick is:  29.448281932611547 %


In [16]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Yukon" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Yukon" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Yukon is: ", (5740 / 53925.0) * 100, "%")

The percentage of French speakers in Yukon is:  10.644413537320352 %


In [17]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Northwest Territories" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Northwest Territories" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Northwest Territories is: ", (4445 / 56330.0) * 100, "%")

The percentage of French speakers in Northwest Territories is:  7.890999467424107 %


In [18]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Prince Edward Island" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Prince Edward Island" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Prince Edward Island is: ", (19445 / 188370.0) * 100, "%")

The percentage of French speakers in Prince Edward Island is:  10.322769018421193 %


In [19]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Manitoba" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Manitoba" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Manitoba is: ", (111790 / 1858665.0) * 100, "%")

The percentage of French speakers in Manitoba is:  6.014531935555896 %


In [20]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Nova Scotia" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Nova Scotia" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Nova Scotia is: ", (99300 / 1165065.0) * 100, "%")

The percentage of French speakers in Nova Scotia is:  8.523129610794248 %


In [21]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Quebec" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Quebec" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Quebec is: ", (7786740 / 14270145.0) * 100, "%")

The percentage of French speakers in Quebec is:  54.56664946291716 %


In [22]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Saskatchewan" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Saskatchewan" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Saskatchewan is: ", (52065 / 1399270.0) * 100, "%")

The percentage of French speakers in Saskatchewan is:  3.720868738699465 %


In [23]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Nunavut" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Nunavut" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Nunavut is: ", (1455 / 63595.0) * 100, "%")

The percentage of French speakers in Nunavut is:  2.2879157166443904 %


In [24]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, VALUE from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Newfoundland and Labrador" AND GEO not in ("Canada") AND KOL = "French";', engine)

query3_table = pd.read_sql_query('SELECT GEO, SMRKOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Newfoundland and Labrador" AND GEO not in ("Canada");', engine)

print("The percentage of French speakers in Newfoundland and Labrador is: ", (26130 / 552125.0) * 100, "%")

The percentage of French speakers in Newfoundland and Labrador is:  4.732623952909214 %


I looked at the French speaking people in each province and found the percentages according to the total number of language speakers (mannual summation) in each province.

**(2 marks)**

As a percentage of all Chinese speakers in the country, how many Chinese speakers do not speak Mandarin?

In [25]:
query2_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Canada" AND KOL IN ("Hakka", "Min Dong", "Min Nan (Chaochow, Teochow, Fukien, Taiwanese)", "Wu (Shanghainese)", "Yue (Cantonese)", "Chinese, n.o.s.");', engine)

query3_table = pd.read_sql_query('SELECT DISTINCT GEO, KOL, SUM(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND KOL NOT LIKE "%languages%" AND GEO = "Canada" AND KOL IN ("Hakka", "Min Dong", "Mandarin", "Min Nan (Chaochow, Teochow, Fukien, Taiwanese)", "Wu (Shanghainese)", "Yue (Cantonese)", "Chinese, n.o.s.");', engine)

print("The percentage of Chinese Speakers That Do Not Speak Mandarin is: ", (818715.0 / 1806015.0) * 100, "%")

The percentage of Chinese Speakers That Do Not Speak Mandarin is:  45.33267996112989 %


## Part C: Detailed analysis (20 marks)

Now consider being given a task to make sense of the distribution of languages spoken in Canada using this dataset.

**Question 1 (4 marks)**

Create two guiding questions to use in your analysis, and include them below as Markdown. As a starting point (and remember you are not limited to only these!), you may want to consider the following ideas:
- focus on a specific set of languages which are interesting to you
- consider the effect of language speakers who speak only one or multiple languages
- differences between specific provinces and territories in terms of patterns of language speakers

Guiding Question 1:

Across Alberta, British Columbia, and Ontario; which of these provinces (from least to greatest) have the largest Nehiyawewin (Plains Cree) speaking community?

Guiding Question 2:

Across Manitoba, Prince Edward Island, and Nova Scotia; which of these provinces (from least to greatest) have the smallest English speaking community - as their only language of communication?

**Question 2 (12 marks)** 

Write at least four queries (two for each question) which you believe will address one of your guiding questions. Clearly indicate which queries address your questions. You may wish to include a comment to explain why this query will help address your question.

Query 1 (Guiding Question 1):

In [26]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "British Columbia" AND KOL = "Nehiyawewin (Plains Cree)";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,British Columbia,Nehiyawewin (Plains Cree),440


Query 2 (Guiding Question 1):

In [27]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "Alberta" AND KOL = "Nehiyawewin (Plains Cree)";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,Alberta,Nehiyawewin (Plains Cree),4270


Query 3 (Guiding Question 1):

In [28]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "Ontario" AND KOL = "Nehiyawewin (Plains Cree)";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,Ontario,Nehiyawewin (Plains Cree),135


From Least to Greatest:

1. Ontario (135)
2. British Columbia (440)
3. Alberta (4270)

Query 1 (Guiding Question 2):

In [29]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "Manitoba" AND KOL = "English";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,Manitoba,English,1288950


Query 2 (Guiding Question 2):

In [30]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "Prince Edward Island" AND KOL = "English";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,Prince Edward Island,English,149525


Query 3 (Guiding Question 2):

In [31]:
query1_table = pd.read_sql_query('SELECT GEO, KOL, MAX(VALUE) from language_speakers WHERE SMRKOL = "Total - Single and multiple responses of knowledge of languages" AND GEO not in ("Canada") AND GEO = "Nova Scotia" AND KOL = "English";', engine) #Create the first query
query1_table.head()

Unnamed: 0,GEO,KOL,MAX(VALUE)
0,Nova Scotia,English,951945


From Least to Greatest:

1. Prince Edward Island (149525)
2. Nova Scotia (951945)
3. Manitoba (1288950)

**Question 3 (4 marks)**

What kind of data would be interesting to receive in combination with the data in this table? Discuss how you could use this additional information to extend one of your guiding questions.

Data pertaining to the level of household disposable income, standard of living, exact housing community locations, family characteristics, and marital status would all be important datasets to consider in conjunction with the languages_speakers.csv file. This is because we would then be able to further elaborate our understanding on the surrounding factors that influence whether an individual has the ability to speak one or more languages (official and non-offficial languages included). For example, if a family has more disposable income then we would be able to draw connections with how many languages that family speaks and whether or not their access to educational services has played a role (minor or major) in allowing those individuals to have the capacity to learn those languages. Similarly, if a community is based in a more traditional reserve-like setting (for Indigenous populations) would be able to expand our knowledge on how the languages of such Indigenous populations are being preserved as well as what trends we can find regarding the growth of those languages' speakers within Canada (across multiple provinces and territories). In relation to my guiding questions, I would like to see how the family characteristics and marital status of different Nehiyawewin (Plains Cree) speaking communities influences the populations of such individuals to increase. The expansion of this guiding question would be allow me to enhance my own understanding of how Indigenous languages are being taught and preserved (possible due to oral tradition). Likewise, for my second guiding question I would want to expand this question by focusing my efforts at examining the direct household community locations; as this can directly correlate with the influences children within those communities may have to speak one or more languages. If a community is diversified with multiple ethnicities and races, it would be safe to assume that the influence to learn more than just one's own mother tongue, and in the process, adopt multiple languages of speaking.

## Part D: Reflection (5 marks)

In 100 to 250 words, identify a concept you have found difficult or confusing from this assignment. Reflect on how your previous learning or experience helped you to understand this concept. Provide your reflection using markdown in the cell below

The main concept from this assignment that I particularly found challenging were two: understanding the COORDINATE column and filtering with SQL queries. With regards to the first challenge I used my critical-thinking skills to help decipher what the COORDINATE points meant (x.x.x.x.x.x) and by identifying patterns in the language_speakers.csv file I was able to determine what each value within the COORDINATE points had meant (explained earlier in this assignment). The usage of pattern identification had helped me to get a better understanding of the dataset overall and made my future querying processes much more effective as well as efficient. With regards to the completion of filtering SQL my first approach was to attempt each query through a step-by-step approach, where I would implement one component of what I wanted SQL to derive from the data and increment my code to include specific information through the addition of more filters. By incrementing my queries in such a way I was able to add the various pieces of the SQL filter sequentially and at the same time greatly reduced the time needed to complete my queries. My main approach within this second case was to firstly visualize what information I wanted to have SQL dissect for me; I would write out in simple language on a piece of paper what I wanted the SQL query to return to me (whilst also using the sequential methodology aforementioned). From there I would work to specify what information was necessary and non-necessary to help reduce the amount of lines (decreasing inefficiencies with my code in the process) to help optimize my queries. I would cross reference my answers with the Excel filtering tool to help ensure that I had received the values I was directly looking for. For the most part with regards to this assignment an adjustment in thinking and methodologies was required as I had already used SQL in past university coursework, it was merely how to adapt my previous experiences with the programming languages into these new tasks that I had aimed to resolve.

In [32]:
# Use this cell to include some code to dispose of your SQLAlchemy engine object
engine.dispose()
myconnection.close()

## References

Statistics Canada (Government of Canada). 2022. Knowledge of languages by age and gender: Canada, provinces and territories, census metropolitan areas and census agglomerations with parts. Retrieved October 15, 2022 from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021701

University of Calgary (Faculty of Science - Computer Science). 2022. Getting started with SQLAlchemy and pandas. Retrieved November 22, 2022 from https://d2l.ucalgary.ca/d2l/le/content/472036/viewContent/5581075/View

University of Calgary (Faculty of Science - Computer Science). 2022. Getting started with Python SQL Connectors. Retrieved November 20, 2022 from https://d2l.ucalgary.ca/d2l/le/content/472036/viewContent/5581071/View