# Assignment Purpose

In this assignment, we will analyze a dataset using the built-in parts of Python3.  
You should only use Python3 built-in modules (i.e. do not import pandas etc.).  
"Built-ins" refer to functions, module, keywords etc. that are packaged with Python3.  
Refer to https://docs.python.org/3/library/ if you are unsure what is included.

The dataset we chose is taken from kaggle's open data:

https://www.kaggle.com/osmi/mental-health-in-tech-survey

We will distribute the survey.csv file so you don't have to download the dataset.  
Take a look at the dataset description to understand the semantics of the different columns.

The first thing you should do is upload the csv file to your collaboratory instance.  
We have supplied code below to accomplish this:

In [1]:
# The following is code for uploading a file to the colab.research.google
# environment.

# library for uploading files
from google.colab import files

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

In [2]:
upload_files()

Saving survey.csv to survey.csv
User uploaded file "survey.csv" with length 303684 bytes


## Mandatory Tasks

### Task 1

Use the `csv` built-in module to read the survey.csv file:

https://docs.python.org/3/library/csv.html

You can do any preprocessing or set up any data structures you want to use to do the rest of the tasks.

In [3]:
import csv
f=open("survey.csv", "r")
survey_reader = csv.reader(f, delimiter=",")
type(survey_reader)

_csv.reader

In [4]:
import csv
f=open("survey.csv", "r")
survey_reader = csv.reader(f, delimiter=",")
data_list = list(survey_reader)
data_list


[['Timestamp',
  'Age',
  'Gender',
  'Country',
  'state',
  'self_employed',
  'family_history',
  'treatment',
  'work_interfere',
  'no_employees',
  'remote_work',
  'tech_company',
  'benefits',
  'care_options',
  'wellness_program',
  'seek_help',
  'anonymity',
  'leave',
  'mental_health_consequence',
  'phys_health_consequence',
  'coworkers',
  'supervisor',
  'mental_health_interview',
  'phys_health_interview',
  'mental_vs_physical',
  'obs_consequence',
  'comments'],
 ['2014-08-27 11:29:31',
  '37',
  'Female',
  'United States',
  'IL',
  'NA',
  'No',
  'Yes',
  'Often',
  '6-25',
  'No',
  'Yes',
  'Yes',
  'Not sure',
  'No',
  'Yes',
  'Yes',
  'Somewhat easy',
  'No',
  'No',
  'Some of them',
  'Yes',
  'No',
  'Maybe',
  'Yes',
  'No',
  'NA'],
 ['2014-08-27 11:29:37',
  '44',
  'M',
  'United States',
  'IN',
  'NA',
  'No',
  'No',
  'Rarely',
  'More than 1000',
  'No',
  'No',
  "Don't know",
  'No',
  "Don't know",
  "Don't know",
  "Don't know",
  "Don't 

**Recommended:**

We recommend that you set up some data structures to make working with the data set easier. There are different ways of solving this problem, this is just what we advise. We recommend creating a list of dictionaries. Each dictionary represents a line in the data file. Each dictionary maps a column name to a value. Set up this data structure below and use it in the other tasks:


In [5]:
import csv
f=open("survey.csv", "r")
survey_reader = csv.reader(f, delimiter=",")
data_list = list(survey_reader)

lines = []
header = data_list[0]
for cur_line in data_list[1:]:
  new_line = {}
  for i in range(len(cur_line)):
    cur_header = header[i]
    cur_value = cur_line[i]
    new_line[cur_header] = cur_value
  lines.append(new_line)
lines


[{'Timestamp': '2014-08-27 11:29:31',
  'Age': '37',
  'Gender': 'Female',
  'Country': 'United States',
  'state': 'IL',
  'self_employed': 'NA',
  'family_history': 'No',
  'treatment': 'Yes',
  'work_interfere': 'Often',
  'no_employees': '6-25',
  'remote_work': 'No',
  'tech_company': 'Yes',
  'benefits': 'Yes',
  'care_options': 'Not sure',
  'wellness_program': 'No',
  'seek_help': 'Yes',
  'anonymity': 'Yes',
  'leave': 'Somewhat easy',
  'mental_health_consequence': 'No',
  'phys_health_consequence': 'No',
  'coworkers': 'Some of them',
  'supervisor': 'Yes',
  'mental_health_interview': 'No',
  'phys_health_interview': 'Maybe',
  'mental_vs_physical': 'Yes',
  'obs_consequence': 'No',
  'comments': 'NA'},
 {'Timestamp': '2014-08-27 11:29:37',
  'Age': '44',
  'Gender': 'M',
  'Country': 'United States',
  'state': 'IN',
  'self_employed': 'NA',
  'family_history': 'No',
  'treatment': 'No',
  'work_interfere': 'Rarely',
  'no_employees': 'More than 1000',
  'remote_work': 'No

In [None]:
family_hist_values = []
count = 0
for line in lines:
  if line['family_history']=="Yes":
    count += 1
print(count)



492


### Task 2

Report the percentage of respondant who work at companies with more than 1000 employees.

The result should be formatted and printed with a formatted string. Percentages should include 1 decimal point.

In [12]:
count = 0
for line in lines:
  if line['no_employees']=="More than 1000":
    count += 1
print ("Companies Count:",count)
print ("Total Companies Count:",len(lines))
print("Percentage:",f"{count/len(lines)*100:.1f}")

Companies Count: 282
Total Companies Count: 1259
Percentage: 22.4


### Task 3

How many self employeed respondants are in California? You must use the "and" boolean operator.

In [16]:
count = 0
for line in lines:
  if line['self_employed']=='Yes' and line['state']=='CA':
    count += 1
print ("self employeed respondants are in California Count:",count)

self employeed respondants are in California Count: 10


### Task 4

Consider only those respondants who work at tech companies.  
Report the proportion of those companies where the employees know they have benefits per country.  
I.e. For each country, count the number of respondants who said Yes to if their company had benefits.

When presenting the results, you must do the following:
*  Have each countries results printed on its own line
*  Each countries results must be printed using a formatted string
*  Each line must show the count as well as the percentage (percentage computed for each country)
*  Sort the results before printing. Results should be sorted by count (highest to lowest) and should use alphabetical order of the country name as a tie-breaker.

In [None]:
tech_company_per_country = {}
for current_line in lines:
  if current_line["tech_company"] == "Yes":
    if current_line["Country"] not in tech_company_per_country:
      tech_company_per_country[current_line["Country"]] = 0
    tech_company_per_country[current_line["Country"]] += 1
print(tech_company_per_country)

In [20]:
country_results = {}

for line in lines:
  if line['tech_company'] == 'Yes':
    country = line['Country']
    if country not in country_results:
      country_results[country] = {'yes': 0, 'total': 0}
    country_results[country]['total'] += 1
    if line['benefits'] == 'Yes':
      country_results[country]['yes'] += 1

sorted_results = sorted(country_results.items(), key=lambda x: (-x[1]['yes'], x[0]))


for country, result in sorted_results:
  percentage = result['yes'] / result['total'] * 100
  print(f"{country}: {result['yes']} of {result['total']} ({percentage:.1f}%)")


United States: 320 of 611 (52.4%)
Canada: 20 of 56 (35.7%)
United Kingdom: 9 of 139 (6.5%)
Germany: 5 of 40 (12.5%)
Netherlands: 3 of 25 (12.0%)
Australia: 2 of 17 (11.8%)
Ireland: 2 of 26 (7.7%)
Sweden: 2 of 7 (28.6%)
Switzerland: 2 of 7 (28.6%)
Austria: 1 of 3 (33.3%)
Bahamas, The: 1 of 1 (100.0%)
Belgium: 1 of 3 (33.3%)
Croatia: 1 of 2 (50.0%)
France: 1 of 12 (8.3%)
New Zealand: 1 of 7 (14.3%)
Norway: 1 of 1 (100.0%)
Russia: 1 of 3 (33.3%)
Bosnia and Herzegovina: 0 of 1 (0.0%)
Brazil: 0 of 6 (0.0%)
Bulgaria: 0 of 4 (0.0%)
China: 0 of 1 (0.0%)
Colombia: 0 of 2 (0.0%)
Costa Rica: 0 of 1 (0.0%)
Czech Republic: 0 of 1 (0.0%)
Denmark: 0 of 2 (0.0%)
Finland: 0 of 2 (0.0%)
Greece: 0 of 2 (0.0%)
Hungary: 0 of 1 (0.0%)
India: 0 of 9 (0.0%)
Israel: 0 of 4 (0.0%)
Italy: 0 of 5 (0.0%)
Japan: 0 of 1 (0.0%)
Latvia: 0 of 1 (0.0%)
Mexico: 0 of 3 (0.0%)
Moldova: 0 of 1 (0.0%)
Nigeria: 0 of 1 (0.0%)
Philippines: 0 of 1 (0.0%)
Poland: 0 of 7 (0.0%)
Portugal: 0 of 2 (0.0%)
Singapore: 0 of 4 (0.0%)
Slov

## Optional Tasks

### Optional Task 1

What is are the max and min ages of respondants? Remove values that don't make sense.

Bonus: Do each max/min in one line. Make use of the min, max, map, and filter function.  
You can parse data into a list of lists without it counting to your line count


### Optional Task 2

Analyze the frequency of words in the comments section of the survey.

You should use a `dict` to count the frequency of words across all comments.

Report the top ten most frequent words.

You should do the following pre-processing:
*  Remove punctuation from the words before counting them
*  Remove "useless" words (the, a, an, me, you etc.). You should use your own judgement as to what "useless" means. Think about the kinds of analytics someone might want to do with this data.

Write down possible uses of your finding as comments.