# Inlämningsuppgift / Assignment
The assignment is divided upp into 3 parts; 
1. Numpy
2. Pandas
3. Exploration

Each section has it's own instructions to follow and questions that must be answered. Please observe that if you use any additional libraries apart from **numpy**, **pandas** or **matplotlib**, you must include an **environment.yml** file such that I can duplicate your conda environment. 

Deadline for submitting this assignment is `Monday Feb 21st at 23:59`.

#### List of files
 - *Assignment.ipynb* - which is to be renamed with your name and course town as such: **Firstname.Lastname_TOWN** where TOWN is to be replaced with **MO** for Malmö or **HMS** for Halmstad.
 - *countries.csv*
 - *covid-countries-data.csv*
 - *whatsapp analysis example.pdf*

### Grading
In order to obtain a **G** you must: 
>- Complete the whole *Numpy* section. 
>- Complete *Part 2*, except the questions marked **Q - VG**. 
>- Complete *Part 3*, except *Step 4* and the **VG** question in *Step 5*.

To obtain **VG** you must:
>- Complete all of the steps required in the **G** section. 
>- Complete the **Q - VG** questions in *Part 2*. 
>- Complete *Step 4* in *Part 3*.

##### Resources: 
- [Numpy official tutorial](https://numpy.org/doc/stable/user/quickstart.html)
- [Matplotlib](https://github.com/rougier/matplotlib-tutorial)

## Part 1 - Numpy
The objective of this part of the assignment is to develop a solid understanding of Numpy array operations. In this assignment you will:
> 
> 1. Pick 5 interesting Numpy array functions by going through the documentation: https://numpy.org/doc/stable/reference/routines.html
> 2. Run and modify this Jupyter notebook to illustrate their usage (some explanation and 3 examples for each function). Use your imagination to come up with interesting and unique examples.
> 3. Do not use any of the functions mentioned on slide 11 of lecture notes *6. Datahantering och Numpy*. Choose something new!
> 4. Try to give this section an interesting title_labels & subtitle e.g. "*5 Numpy functions you didn't know you needed*", "*Interesting ways to create Numpy arrays*" etc.

## Bitwise functions in numpy


Bitwise functions does operation on the binary representation of numbers. 
- bitwise_and 
- bitwise_or
- bitwise_xor
- packbits
- matmul

In [None]:
import numpy as np

List of functions explained 
1. bitwise_and - When both numbers in binary representation is equal the new value has that number, if they're not equal it's 0. Follows logical and
2. bitwise_or - When either one of the value in the binary representation is a 1 or both has a 1 the new value gets a 1, if both are 0 the new value is 0. Follows logical or
3. bitwise_xor - When either one is a 1 in the binary representation the new value is 1, if both are either 1 or 0 the new value is 0. 
4. linalg.det - a linear algebra function that calculates the determinant
5. matmul -a linear algebra function to calculate the matrix multiplication between two 

## bitwise_and

Bitwise operation uses an array of values that's either true or false. It can be either an array filled with true and false values or a number in it's binary representation.

Takes two array like arguments as input, an optional "out" and "where".

Out is a location where the result then will be stored and therefor must have the same shape the inputs outputs.

Where is a condition that broadcasted over the input. Where the condition is True, the output array will be set to the ufunc result and otherwise it'll retain it's original value.

The output of the  function is an ndarray or a scalar.


In [213]:
A = 11
B = 13
np.bitwise_and(A,B)

9

In the code above:

    A = 1011 in binary

    B = 1101 in binary

what bitwise_and does follows logical and, which means that it looks at each seperate value and if both is equal to true, the output value is also true.

reading left to right:

    first number = 1 because both values are 1

    second number = 0 because A is 0

    third number = 0 because B is 0

    fourth umber = 1 because both are 1
    
That leaves us with 1001 in binary which corresponds to 9


In [214]:
A = [True, False, False, True]
B = [False, True, False, True]
np.bitwise_and(A,B)

array([False, False, False,  True])

following the same pattern as with the binary representation we get:

[False, False, False, True] because only the last element in both arrays are True 

In [229]:
A = 5
B = [True, False, True, False]
np.bitwise_and(A,B)

array([1, 0, 1, 0], dtype=int32)

notable this also works by combining a number and a list of true and false values (B = 0101)

In [231]:
A = 5.0
B = [True, False, True, False]
np.bitwise_and(A,B)

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [269]:
A = 5
B = [True, False, True, False]
C = np.ndarray(shape=(1,1))
np.bitwise_and(A, B, out=C)


ValueError: non-broadcastable output operand with shape (1,1) doesn't match the broadcast shape (1,4)

bitwise doesn't work for types that cannot be represented as a binary-valued array. This is true for bitwise_or and bitwise_xor aswell
bitwise also throws a ValueError when the output shape is doesn't match the broadcasted shape

## bitwise_or

bitwise_or works that if either of the values are a 1, the new value has a 1 in that location.

It can be either an array filled with true and false values or a number in it's binary representation.

Takes two array like arguments as input, an optional "out" and "where".

Out is a location where the result then will be stored and therefor must have the same shape the inputs outputs.

Where is a condition that broadcasted over the input. Where the condition is True, the output array will be set to the ufunc result and otherwise it'll retain it's original value.

The output of the  function is an ndarray or a scalar.

In [232]:
A = 11
B = 13
np.bitwise_or(A,B)

15

A = 1011

B = 1101

first number = 1 because both are 1

second number = 1 because B is 1

third number = 1 because A is 1

fourth number = 1 because both are 1

that gives us 1111 which equals 15

In [233]:
A = [True, True, False, False]
B = [True, False, False, True]
np.bitwise_or(A,B)

array([ True,  True, False,  True])

first element = True since both are true

second element = True since A is true

third element = False since both are false

fourth element = True since B is true

that gives us [True, True, False, True]


## bitwise_xor

bitwise_xor gives a value 1 if either of the values are 1 but not both.
It can be either an array filled with true and false values or a number in it's binary representation.
Takes two array like arguments as input, an optional "out" and "where".
Out is a location where the result then will be stored and therefor must have the same shape the inputs outputs.
Where is a condition that broadcasted over the input. Where the condition is True, the output array will be set to the ufunc result and otherwise it'll retain it's original value.
The output of the  function is an ndarray or a scalar.


In [261]:
A = 11
B = 13
np.bitwise_xor(A, B)


ValueError: The 'out' tuple must have exactly one entry per ufunc output

A = 1011

B = 1101

first number = 0 because both are 1

second number = 1 because A is 1 and B is 0

third number = 1 because B is 1 and A is 0

fourth number = 0 because both are 1

that gives us 0110 which equals 6

In [240]:
A = [True, True, False, False]
B = [True, False, False, True]
np.bitwise_xor(A, B)


array([False,  True, False,  True])

first element = False since both are true

second element = True since A is true and B is false

third element = False since both are false

fourth element = True since B is true and A is false


## linalg.det

lingalg.det is a function to calculate the determinant of an array. The determinant is the factor by which space is scaled in a linear transformation

the function takes an array (or a stack of matrices) as input and gives an float or an array as output

In [260]:
a = np.array([[1, 2], [3, 4]])
np.linalg.det(a)


TypeError: _unary_dispatcher() got an unexpected keyword argument 'out'

determinant of a 2x2 matrix is calculated by doing 

a * d - b * c 

=

1 * 4 - 2 * 3= -2

In [242]:
a = np.array([[[1, 2], [3, 4]], [[1, 2], [2, 1]], [[1, 3], [3, 1]]])
np.linalg.det(a)


array([-2., -3., -8.])

input is a stack of matrices and output is an array with values for each matrix

first element is same as example above

second element is  1 * 1 - 2 * 2 = -3

third element is 1 * 1 - 3 * 3 = -8

which gives us the array [-2., -3., -8.]


In [251]:
a = np.array([[1, 2,3], [1,3, 4]])
np.linalg.det(a)

LinAlgError: Last 2 dimensions of the array must be square

The determinant can only be calculated with a square matrix, if function is called with non square matrix as parameter a LinAlgError will be thrown

## matmul

matmul is function to do matrix multiplication on. Takes in 2 arrays and output is of array type 

In [253]:
a = np.array([[1, 0],
              [0, 1]])
b = np.array([[4, 1],
              [2, 2]])
np.matmul(a, b)


array([[4, 1],
       [2, 2]])

In [254]:
a = np.array([[1, 0],
              [0, 1]])
b = np.array([1, 2])
np.matmul(a, b)


array([1, 2])

gives you an array like object as output of the two matrixes multiplied

In [257]:
a = np.array([[1, 0],
              [0, 1]])
b = np.array([1, 2,3])
np.matmul(a, b)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 2)

If the dimensions needed for matrix multiplication doesn't match a ValueError will be thrown

### Conclusion

Summarize what was covered in *Part 1*, and where to go next.


bitwise operations are element wise operation done on arrays that have a binary representation. 

Using function like bitwise_and that follows logical and which menas that when both values are the same value, the new value is true.

bitwise_or returns true to the new position when either both or one of the values are true otherwise false

bitwise_xor returns true to the new position when only one of the values are true 

The bitwise functions can be used to help solve problems more efficiently like the 8 queens puzzle

https://en.wikipedia.org/wiki/Eight_queens_puzzle

bitwise functions breaks when the input cannot be represented as a binary-valued array or when the out input is not the same shape as the broadcasted output

for future reading look at the other bitwiwse operation such as invert, left_shift and right_shift at https://numpy.org/doc/stable/reference/routines.bitwise.html

Want a deeper understanding look at mit lecture in their "Performance Engineering of Software Systems" class från 2018 https://www.youtube.com/watch?v=ZusiKXcz_ac



det calculates the determinant of the given array. The array must be a square matrix otherwise an error will be thrown. Output is a scalar or an ndarray

matmul calculates a matrix multiplication of given arrays either 2 or a stack of them. Error will be thrown when inputs have a dimension mismatch. Output is a ndarray

# Part 2 - Pandas

As you go through *Part 2*, you will find a **???** in certain places. To complete this part of the assignment, you must replace all the **???** with appropriate values, expressions or statements to ensure that the notebook runs properly end-to-end. 

Some things to keep in mind:

* Make sure to run all the code cells, otherwise you may get errors like `NameError` for undefined variables.
* Do not change variable names, delete cells or disturb other existing code. It may cause problems during evaluation.
* In some cases, you may need to add some code cells or new statements before or after the line of code containing the **???**. 
* Questions marked **Q - VG** are for **VG level**.


In [None]:
import pandas as pd

Load the data from the supplied CSV file into a Pandas data frame.

In [None]:
countries_df = pd.read_csv('countries.csv')

In [None]:
countries_df

**Q1: How many countries does the dataframe contain?**
(Show which function/s you use to find this out.)

In [None]:
num_countries = len(countries_df) 

In [None]:
print('There are {} countries in the dataset'.format(num_countries))

**Q2: Retrieve a list of continents from the dataframe?**

In [None]:
continents = np.unique(countries_df['continent'])


In [None]:
continents

**Q3: What is the total population of all the countries listed in this dataset?**

In [None]:
total_population = countries_df['population'].sum()


In [None]:
print('The total population is {}.'.format(int(total_population)))

**Q4: Create a dataframe containing 10 countries with the highest population.**

In [None]:
most_populous_df = countries_df.nlargest(10, 'population')


In [None]:
most_populous_df

**Q5: Add a new column in `countries_df` to record the overall GDP per country (product of population & per capita GDP).**



In [None]:

countries_df['gdp'] = countries_df['gdp_per_capita'] * countries_df['population']


In [None]:
countries_df

**Q - VG: Create a dataframe containing 10 countries with the lowest GDP per capita, among the countries with a population greater than 100 million.**

In [None]:
c_df = countries_df[countries_df.population > 100e6].nsmallest(10, 'gdp')

In [None]:
c_df

**Q6: Create a DataFrame that counts the number countries on each continent?**

*Hint: `groupby`.*

In [None]:
country_counts_df = countries_df.groupby('continent').size()

In [None]:
country_counts_df

**Q7: Create a data frame showing the total population of each continent.**

In [None]:
continent_populations_df = countries_df.groupby('continent')['population'].sum()

In [None]:
continent_populations_df

Next, use the CSV file containing overall Covid-19 stats for various countires, and read the data into another Pandas data frame.

In [None]:
covid_data_df = pd.read_csv('covid-countries-data.csv')

In [None]:
covid_data_df

**Q8: Count the number of countries for which the `total_tests` data is missing.**


In [None]:
total_tests_missing = covid_data_df['total_tests'].isna().sum()

In [None]:
print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))

Let's merge the two data frames, and compute some more metrics.

**Q9: Merge `countries_df` with `covid_data_df` on the `location` column.**


In [None]:
combined_df = countries_df.merge(covid_data_df, how='inner', on='location')

In [None]:
combined_df

**Q10: Add columns `tests_per_million`, `cases_per_million` and `deaths_per_million` into `combined_df`.**

In [None]:
combined_df['tests_per_million'] = combined_df['total_tests'] * 1e6 / combined_df['population']

In [None]:
combined_df['cases_per_million'] = combined_df['total_cases'] * 1e6 / combined_df['population']

In [None]:
combined_df['deaths_per_million'] = combined_df['total_deaths'] * 1e6 / combined_df['population']

In [None]:
combined_df

**Q11: Create a dataframe with 10 countires that have highest number of tests per million people.**

In [None]:
highest_tests_df = combined_df.nlargest(10, 'tests_per_million')

In [None]:
highest_tests_df

**Q12: Create a dataframe with 10 countires that have highest number of positive cases per million people.**

In [None]:
highest_cases_df = combined_df.nlargest(10, 'cases_per_million')

In [None]:
highest_cases_df

**Q13: Create a dataframe with 10 countires that have highest number of deaths cases per million people?**

In [None]:
highest_deaths_df = combined_df.nlargest(10, 'deaths_per_million')

In [None]:
highest_deaths_df

**Q - VG: Count number of countries that feature in both the lists of "highest number of tests per million" and "highest number of cases per million".**

In [None]:
combined_df.nlargest(10, 'tests_per_million')['location'].isin(combined_df.nlargest(10, 'cases_per_million')['location']).sum()

**Q - VG: Count number of countries that feature in both the lists "20 countries with lowest GDP per capita" and "20 countries with the lowest number of hospital beds per thousand population". Only consider countries with a population higher than 10 million while creating the list.**

In [None]:
combined_df[combined_df.population > 10e6].nsmallest(20, 'gdp_per_capita').isin(combined_df[combined_df.population > 10e6].nsmallest(20, 'hospital_beds_per_thousand'))['location'].sum()


# Part 3 - Exploration
The object of *Part 3* is for you to reflect upon what kind of data is interesting to you and using an example dataset, examine and explain what the data is like and what kind of things you could find out using it. 

Pick a real-world dataset of your choice and perform an exploratory data analysis. Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations wherever possible using Markdown cells.

### Evaluation Criteria

Your submission will be evaluated using the following criteria:

>* Dataset must contain at least 5 columns and 500 rows of data
>* You must ask and answer at least 4 questions about the dataset
>* Your submission must include at least 4 visualizations (graphs) with axes, title_labels and any other annotations necessary to understand the graph. 
>* Your submission must include explanations using markdown cells, apart from the code.
>* Your work must not be plagiarized i.e. copy-pasted for somewhere else.

#### Dataset repositories: 
- [UCI repository](http://archive.ics.uci.edu/ml/index.php)
- [Public datasets](https://github.com/awesomedata/awesome-public-datasets)
- [Google dataset search](https://datasetsearch.research.google.com/)
- [Kaggle datasets](https://www.kaggle.com/datasets?fileType=csv)

#### Example datasets:
- https://www.kaggle.com/datasnaek/youtube-new
- https://www.kaggle.com/imdevskp/corona-virus-report
- https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

#### Example Projects

Refer to these projects for inspiration:

* [Analyzing your browser history using Pandas & Seaborn](https://medium.com/free-code-camp/understanding-my-browsing-pattern-using-pandas-and-seaborn-162b97e33e51) by Kartik Godawat

* [2019 State of Javscript Survey Results](https://2019.stateofjs.com/demographics/)

* [2020 Stack Overflow Developer Survey Results](https://insights.stackoverflow.com/survey/2020)



## Follow this step-by-step guide to work on your project.

### Step 1: Select a real-world dataset 

>- Find an interesting dataset at any of the recommended repositories below.
>- The data should be in CSV format, and should contain at least 5 columns and 500 rows
>- Download the dataset using pandas read_csv function and an url. See example below. (Please note that when downloading from kaggle, you will have to amend this code.) Alternatively, supply the exact link from which you got the dataset and any necessary instructions to download, unpack, load into dataframe etc. in order to get it to work.

`import pandas as pd`

`url = 'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'`

`c = pd.read_csv(url)`


### Step 2: Perform data preparation & cleaning

>- Load the dataset into a data frame using Pandas.
>- Explore the number of rows & columns, ranges of values etc.
>- Handle missing, incorrect and invalid data.
>- Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.).
>- Give a summary of the dataset as it is now, e.g. size, type of categories (qualitative vs. quantitative), quality, distribution etc.. 


### Step 3: Perform exploratory analysis & visualization

>- Compute the mean, sum, range and other interesting statistics for numeric columns.
>- Explore distributions of numeric columns using histograms etc.
>- Explore relationship between columns using scatter plots, bar charts etc.
>- Make a note of interesting insights from the exploratory analysis.

### Step 4: Ask & answer questions about the data - VG

>- Ask at least 4 interesting questions about your dataset. What kind of analysis could you do on this data?
>- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib.
>- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary.
>- Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does.


### Step 5: Summarize your inferences & write a conclusion

>- Write a summary of what you've learned from the analysis.
>- Include interesting insights and graphs from previous sections.
>- **(VG)** Share ideas for future work on the same topic using other relevant datasets.
>- Share links to resources you found useful during your analysis.



INTRODUCTION


The dataset I've chosen to work on is the Kaggle survey for data science and machine learning in 2021.
Kaggle is a data science and machine learning community where people can share datasets and findings. They also host a bunch of competitions with real life problems often sponsored by real companies.

For me the survey is interesting because we can ask some questions in regards to how people approach the topic today and since this is a topic im currently studying hopefully i can learn some valueable lessons in regards to how my approach to the topic differs from others today.

I'll be analysing the different questions asked and present both the data inside but also try to correlate the data and draw some conclusions to deeper questions



PREPARATION & CLEANING

In [None]:
#Importing the used libraries in this analysis
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [None]:
#Read in the the raw csv file 
df_raw = pd.read_csv('2021_kaggle_ds_and_ml_survey_responses_only.csv')
df_raw.columns

In [None]:
#Reformat so columns are first row (actual question names) rather then current columns
new_header = df_raw.iloc[0]
df_col_mod = df_raw[1:]
df_col_mod.columns = new_header
df_col_mod.columns

In [None]:
#Copying over the column modified dataframe to do some more cleaning on it
df_modified = df_col_mod.copy()


The dataset contains multiple choice questions, to make it easier to analyse all of the answers together i'll rewrite the questions, example:

'What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python' 

becomes

'What programming languages do you use on a regular basis? (Select all that apply)' 

This works because the 'Answer' column will still contain the selected choice

We do this before we melt the dataframe to save some iteration

In [None]:
for i in df_modified.columns:
    old_name = i
    new_name = i.split('-')[0].strip()
    df_modified.rename(columns={old_name: new_name}, inplace=True) 

Looking at the questions asked in the survey, a lot of them require a lot of additional information in order to draw good conclusions from them. A question like:
"Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis? (Select all that apply)"
Its an interesting question to ask but only in relation to the field as a whole and not only those represented by the participants of the survey.
Looking at the usage of big data products can give you a picture of the current state of the field but when restricted to those who participated in the survey it gets a bit more complicated.
There are 15 different types of job titles that was available as an answer (Other being included) which means that without the data to understand whether or not this reflects the field we cannot draw good conclusions from it.
Add to the fact that the platform has a focus on learning which only attracts a certain group of people and that the country representation isn't really diverse. You have a lot harder time to draw good conclusions from the survey that reflects the field as a whole.
Because of it i choose to do a big abstraction from a majority of the questions in the survey in order to use the data to draw some good conclusions from it
I saved the questions in "question.txt" if you choose to take look at them


In [None]:
#Drop uninstereting questions from dataframe
df_modified.columns
for line in open('question.txt', 'r'):
    df_modified = df_modified.drop(columns=[line.strip()]) 


Since we'll be asking some questions with relations to other ones I'll add the index as a variable before i melt the dataframe for easier access to the values  

In [None]:
df_modified['id'] = df_modified.index
df_modified = df_modified.melt(id_vars=["id"]) #Melts the dataframe (pivots it)
df_modified.dropna(subset=["value"], inplace=True) #Drops row with NaN answer
df_modified.rename(columns={0: 'Question', 'value': 'Answer'}, inplace=True) #Rename columns to Question and Answer
df_modified 

Since the dataset is a survey from a data science and machine learning community, the data cannot be used to relect upon the field as a whole and ought only be viewed from a lens looking at a small subsection of it ,especially without knowing exactly who uses the platform. But considering Kaggle's popularity among people starting out or learning the field, a good framework to have viewing the data would be too look at it through the lens of someone studying the field moreso than someone already well established in it. 
   

FUNCTIONS

To make it easier and have less repetetive code i'll make a couple of functions primarilly for plotting 

In [None]:
def plot_unrelated(question,knd, yl, xl, figsz, fontsz, patchsz):
    plt.figure(facecolor="white") #Changes the figure color to white 
    df_sqr = df_modified[df_modified['Question'] == question]['Answer'].value_counts(normalize=True).multiply(100) #Counts the amount of times value appears, normalizses it and multiplys the number with 100
    ax = df_sqr.plot(kind=knd, figsize=figsz, fontsize=fontsz, xlabel=xl, title=yl) #Plot the dataframe
    if patchsz>0:
        for i in ax.patches:
            ax.text(i.get_width(), i.get_y(), str(round(i.get_width(), 2))+'%', fontsize=patchsz) #Gets every patch and adds the length of it to the right (this case % of answers)

In [None]:
def plot_related(q1, q2, knd, yl, xl, figsz, fontsz, patchsz):
    df_fir = df_modified[df_modified['Question'] == q1].copy()
    df_sec = df_modified[df_modified['Question'] == q2].copy()
    df_comb = df_fir.merge(df_sec, on="id") #Merges the two dataframes
    df_comb.drop(columns=['Question_x', 'Question_y'], inplace=True)
    dct = {}
    for i in df_comb['Answer_y'].unique(): #Find every unique value in the colukmn
        dct[i] = df_comb[df_comb['Answer_y']== i]['Answer_x'].value_counts()
    df_final = pd.DataFrame(dct)
    df_final = df_final.divide(df_final.sum(axis=1), axis=0).multiply(100).sort_index() #Divide value by total answers and multiply by 100 to get %, then sort the dataframe by index
    ax = df_final.plot(kind=knd, figsize=figsz, fontsize=fontsz)
    ax.set_xlabel(xl, fontsize=fontsz) #Sets the label of the x axis (left side if horizontal bar plot)
    ax.set_title(yl, fontsize=fontsz) #Sets the title of the plot
    if patchsz>0:
        for i in ax.patches: 
            if i.get_width()>0: 
                ax.text(i.get_width(), i.get_y(), str(round(i.get_width(), 2))+'%', fontsize=patchsz)
    ax.legend(prop={'size': 8}) #Sets the legend size of the plot


QUESTIONS

A Quick summary of the questions i'll be asking:
    Basic Question regarding the dataset
        How many questions were asked?
        How many people took the survey

    How does the distribution of survey participants look in terms of:
        Country

        Age

        Gender
        
        Title
        
        Experience
        
        Usage of programming languages
        
        Recommended programming language to start with
        
        Usage of IDE
    

below is the list of questions from the survey we'll use in the analysis and the number of people that answered the survey

In [None]:
print("Questions:")
for i in df_modified['Question'].unique():
    print(f"\t{i}")

print(f"\n{len(df_modified['id'].unique())} people answered the survey")

Country representation

In [None]:
#Calls plot single question function to plot answers out
plot_unrelated('In which country do you currently reside?', 'barh', 'In which country do you currently reside?', "Name of country", (10,10), 10, 10)

In contrast to last years survey which i know had 42 different countries this years only allowed you to choose between 11.
But same as last year india had the most representation with 29% which corrensponds to 7532 participants. 
This might make it harder to draw good concusions from the survey at least in terms of countries since we don't know whether or not this represent the field as a whole.  

Age distribution

In [None]:
#Calls plot single question function to plot answers out
plot_unrelated('What is your age (# years)?', "barh", 'What is your age (# years)?', "Age group", (15,10), 10, 8)

Not unsuprisingly the younger generation is more represented then the older ones after the age group of 25-29 a decline starts to happen.
 

Gender representation

In [None]:
#Calls plot single question function to plot answers out
plot_unrelated('What is your gender?', "barh", 'What is your gender?', "Gender", (15,10), 10, 10)


About 80% of the participants were male, 19% female and slightly below 2% prefered to self-describe

Education distribution

In [None]:
q = 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'
plot_unrelated(q, "barh", q, "Name of education", (15,10), 10, 8)

Almost 90% of the survey participants have attained or planned to a above bachelor's degree. This makes sense considering the nature of the topic. 

Because of how the question is written with (or planned to attain within 2 years) the data from this question is quite scewed. That is because participants that have already attained and current students are lumped into the same category. 
When we later on try to use this data to make other conclusions i'll take that into account and use other datapoints from survey in order to substantiate the findings.

Title distribution

In [None]:
q = "Select the title most similar to your current role (or most recent title if retired):"
plot_unrelated(q, "barh", q, "Name of title", (15,10), 10, 10)

As seen in the plot above, most of the participants are students which corresponds to 26,2% of the answers given.
We'll use this datapoint later on in relation with education level to make sure our conclusions is sound

Title and Education distribution

In [None]:
q1 = "Select the title most similar to your current role (or most recent title if retired):"
q2 = "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
plot_related(q1, q2, "barh", "Title and Education distribution", "Education level", (15,40), 20, 14)

Besides developer relations/advocacy that has a equal parts bachelors and masters for every job masters is the more common degree to have. With exception for Research scientist where the majority had a doctoral. 
For students most are studying for a bachelor or a master.  

Experience distribution

In [None]:
q = 'For how many years have you been writing code and/or programming?'
plot_unrelated(q, "barh", q, "years of experience", (15,10), 10, 8)

A majority of the users using Kaggle is newer to the field of data science and machine learning with roughly 53% having less then 3 years experience with writing code. 
 

Experience and Title distribution

In [None]:
q1 = "Select the title most similar to your current role (or most recent title if retired):"
q2 = 'For how many years have you been writing code and/or programming?'
plot_related(q1, q2, "barh", "Experience and Title distribution", "Title", (15,25), 15, 10)

It's not a surprise to me that for the most part between 0-3 years of experience are amongst the top answers across the titles (with some edgecases) due to the nature of Kaggle being a community with focus on learning so it most likely draws the attention more so to people newer to the field. This means that it probably doesn't reflect the field as a whole so when we draw conclusions from datapoints we have to keep this in mind. 

Some notable differences between the data is for an example that for business analysts and product managers roughly 13 and 12% respectively answered that they've never written code before. If i'd had to pick two titles to be the most common to not write code it would probably been those two so it wasn't that much of a surprise for me.  

Looking at something like machine learning engineer it's interesting to me that 3% have never written code before and 15% have less then 1 years of experience. Don't know whether or not this is because the title is relatively new and number of answers on the survey might not represent the field as a whole.  


Let's assign a value to each label and calculate the avgerage years per title

In [None]:
exp_dct = {"5-10 years": 7.5, "20+ years": 20, "1-3 years": 2, "< 1 years": 0.5, "10-20 years": 15, "I have never written code": 0, "3-5 years": 4}
q1 = "Select the title most similar to your current role (or most recent title if retired):"
q2 = 'For how many years have you been writing code and/or programming?'
df_q1 = df_modified.loc[df_modified['Question'] == q1].copy()
df_q1_cnt = df_q1['Answer'].value_counts()
df_q1_cnt = pd.DataFrame(df_q1_cnt).reset_index().rename(columns={"index": "Title", "Answer": "Number of answers"}) #Resets the index
df_q2 = df_modified.loc[df_modified['Question'] == q2].copy()
df_comb = df_q1.merge(df_q2, on="id")
df_comb.drop(columns=['Question_x', 'Question_y', 'id'], inplace=True)

for key, val in exp_dct.items():
    df_comb['Answer_y'].loc[df_comb['Answer_y'] == key] = val
df_comb['Answer_y'] = pd.to_numeric(df_comb['Answer_y'])
df_comb.groupby('Answer_x')['Answer_y'].mean().sort_values(ascending=False).reset_index().rename(columns={"Answer_x": "Title", "Answer_y": "Avg yrs of coding exp"})
df_comb = pd.DataFrame(df_comb.groupby('Answer_x')['Answer_y'].mean().sort_values(ascending=False)).reset_index().rename(columns={"Answer_x": "Title", "Answer_y": "Avg yrs of coding exp"})
df_comb = df_comb.merge(df_q1_cnt, on='Title')
df_comb

As seen in the dataframe above, the average years of experience is lead by research scientist at 8.5 years.
To note the amount of answers with title database engineer and developer relations/advocacy is below 200 so in order to draw conclusions from those datapoints we'd need more data

Students and people who are currently not employed is on the bottom of the list. with 2.3 and 3.4 average years respectively. Which makes sense due to the the title they hold

Usage of programming languages

In [None]:
q = "What programming languages do you use on a regular basis? (Select all that apply)"
plot_unrelated(q, "barh", q, "Language", (15,10), 10, 8)

To almost no surprise python is the most commonly used programming language by survey participants at 33% and SQL follow at 16%. This makes sense because from at least my point of view those are the two easiest ones to get into

Language and Title distribution

In [None]:
q1 = "Select the title most similar to your current role (or most recent title if retired):"
q2 = "What programming languages do you use on a regular basis? (Select all that apply)"
plot_related(q1, q2, "barh", "Language and title distribution", "Name of title", (15,40), 15, 8)

Some notable exceptions from the average when looking at the usage of progamming languages by title is statisticians usage of R and developer relations/advocacys usage of C++. Remember from the dataframe i showed above though that the answers for both of is quite low so we can't really draw good conclusions from this. But at least for statisticians it makes sense that R is the most common language used.
Also for a title like software engineer it makes sense that Java, C++ and Javascript for an example are more common the rest of the titles

Language and Experience distribution

In [None]:
q1 = "For how many years have you been writing code and/or programming?"
q2 = "What programming languages do you use on a regular basis? (Select all that apply)"
plot_related(q1, q2, "barh", "Language and Experience Distribution", "Years of experience", (15,30), 10, 8)

Looking at it from language usage by experience it follows a similiar pattern, noticably though you can see the usage of older languages such as Bash increase the more experience you have, probably due to having more prevelency at that point in time and have since stuck.   

Recommended programming language

In [None]:
q = "What programming language would you recommend an aspiring data scientist to learn first?"
plot_unrelated(q, "barh", q, "Name of language", (15,10), 10, 8)

To almost no suprise python is the clear favorite among people to learn for data science and machine learning with R and SQL being second and third respectively at 1/16 of the answers compared to python

Usage of IDE

In [None]:
q = "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)"
plot_unrelated(q, "barh", q, "Name of IDE", (15,10), 10, 8)

As seen above jupyter notebook is the most commonly used IDE by participants at 25% with vscode being the second most common at 15%. The least used one with an exception for "None" and "Other" is vim/emacs. Which makes sense considering their age

IDE vs Experience

In [None]:
q1 = "For how many years have you been writing code and/or programming?"
q2 = "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)"
plot_related(q1, q2, "barh", "IDE and Experience distribution", "Years of Experience", (15,30), 15, 8)

Notably following the same trend as with experience and language usage, vim/emacs popularity increases the more experience the participants have

CONCLUSIONS

We've looked at some questions from the Kaggle data science and machine learning survey for 2021 to draw some conclusions about their work

India was the most represented country in the survey with 29% of the answers.
A majority of the answers was between 18-29 years of age and 80% of the answers was male and 19% female.

Almost 90% planned to attain within the next 2 years or had attained at least a bachelor's degree. 26% of the answers were currently students.
Most common among all of the job titles were having a bachelor's or a master's degree with exception of research scientist having master's and doctoral degree.

A majority of participants had less then 3 years of coding experience.
The title with the highest average coding experience was research scientist at 8.5% and lowest was business analyst at 3.7%.
We also found that python had a big dominance both in terms of usage but also as the by far most recommended language to learn first as an aspiring data scientist. Which feels good considering it's what we're learning today.
We've found that a trend was shown by participants that the more experience they have the likelier the usage of older languages such as bash and IDE's such as vim/emcs were more prevenlant probably due to their age and prevelency at the time they started learning.   
There was a similar disitribution of languages with Python and SQL being the most common with exception of statisticians whose most used language was R. There was also a higher use of C++, Java and Javascript within the software engineers.
The most commonly used IDE was jupyter notebook.

For future work, looking at older surveys of the same topic can be good to look for differences in these answers over the years. Something to look for in those surveys is also if a bigger shift has happended over the years, both as the field has grown and as time has passed. Was python always dominating the field or when did it rise to the top? Has the experience level of the field grown or lessened as the years has gone by?
