### Project 1 for Programming for Data Analysis

****

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.

****

In my approach to this project I am going to work with a data set that looks at the attrition levels in a company to identify correlations and attempt to address any drivers. This is inline with some of the data sets I work with in my current career. Employee attrition is something that has been on the rise with companies now looking how they can retain colleagues. I used a data set I took from Kaggle and have amended this to reflect data more in line with what I see.[https://www.kaggle.com/datasets?search=employee&sort=votes]


###  Data Modelling
****
A population includes all of the elements from a data set whereas a sample consists of one or more observation from the population. Samples are used to make statistical inferences about the population. Data modelling is the building block of Python.The process of creating data models using the syntax and environment of the Python programming language is called Data Modelling in Python. A data model is a data abstraction model that organises different elements of the data and standardises the way they relate to one another. In simple words, Data Modelling in Python is the general process by which this programming language organises everything internally and how it treats and processes data.[https://hevodata.com/learn/data-modelling-in-python/]

### Bivariate Analysis
****
Bivariate analysis is an analysis of two variables to determine the relationships between them. Bivariate analysis is helpful in determining to what extent it becomes easier to know and predicts a value for one variable if the value of the other variable is known. There can be a contrast between bivariate analysis and univariate analysis in which only one variable is analyzed. Both univariate analysis and bivariate analysis can be descriptive or inferential. 

Bivariate analysis can be defined as the analysis of bivariate data. It is one of the simplest forms of statistical analysis, which is used to find out if there is a relationship between two sets of values. Usually, it involves the variables X and Y.

- The univariate analysis involves an analysis of one (“uni”) variable.
- The bivariate analysis involves the analysis of exactly two variables.
- The multivariate analysis involves the analysis of more than two variables.
[https://www.vedantu.com/maths/bivariate-analysis]

### Synthesising an Employee Database
****

I downloaded the dataset as a comma separated values (csv) file which holds the data in a plain text list form. The Pandas library in Python allows us to read in the csv file for ease of data analysis. I saved the CSV file with the name Attrition. [https://www.w3schools.com/python/pandas/pandas_csv.asp] You can effectively and easily manipulate CSV files in Pandas using functions like `read_csv()` and `to_csv()`.

In my Python code file I first imported the requisite libraries and settings to allow me to synthesise the data in the file.

I imported libraries as follows:
- `import pandas as pd`
- `import numpy as np`
- `import matplotlib.pyplt as plt` for plotting
- `import sys`
- `import seaborn as sns`

To read in the file I used the following code `attrition = pd.read_csv("HR-Employee-Attrition.csv")`. By default, the `read_csv()` method uses the first row of the CSV file as the column headings. Simple code allows us to easily find out details about the dataset. `attrition.head()` tells us the data has 22 columns and 1000 rows   and we can `print()` the dataset to determine what variables we want to work with i.e. column headings.[https://stackabuse.com/reading-and-writing-csv-files-in-python-with-pandas/].

Once imported I was able to to use various commmands to draw whole data from the data set or specific data as specified in the column heading in the csv file e.g. using commands such as; `attrition.describe()` to get statistical insight like the count, mean values and standard deviation. `attrition.info()` to print information about the dataset including the index dtype and columns, non-null values and memory usage.

I then amended further so that the dataset summary is not shown while starting the program, but output to analysis.txt.Function `summary_to_file()` is created for making the summary and writing it into the file at the same time. 
`def summary_to_file(): sys.stdout = open("summary.txt","w")`. To output the summary into a file I used the sys module and it's standard output stream `stdout`.The output gives various overviews of the dataset.

Following this initial analysis I started with a simple histogram and further amended the code to show four together selecting four different variables.

![Hist-2.png](attachment:Hist-2.png)


#### Histograms
****
We have a dataset with 22 variables over 1000 rows. The benefit of doing the histogram is that it allows us to visually view the data and take some high levels notes on specific variables. From the four histograms above we can quickly see that in the workforce we have:
- attrition level of circa 18%
- the frequency distribution of the age of the workforce with majority aged from mid twenties to mid forties and we can see it is closest to a normal distribution
- the male to female ratio is circa 42:58
- majority of colleagues have worked for the company for ten years or less given by a right skewed distribution

Very easily we have gained some very interesting insights but there are also disadvantages to solely relying on a histogram to answer our questions. Often we need to compare more than one variable at a time to get a full understanding. e.g. of the 18% of leavers we need to understand more about them if we want to address attrition and return it to an acceptable industry standard of between 6-8%. [https://allthingsstatistics.com/miscellaneous/histogram-advantages-disadvantages/]


#### Scatter Plots
****
Scatter plots are used to observe the relationship between variables and uses dots to represent the realtionship between them. The `scatter()` method in the matplotlib library is used to draw a scatter plot. Scatter plots allow us to see the relationship amongst variables and how perhaps they correlate with each other.

From the scatterplot below we have gained some further insights. We can assume attrition after 30 years service is related to retirement. But interestingly it is equally spread out over the first 20 years of service with the exception of year 12. This tells us colleagues are more likely to  move early in their career both as they start out but also as they get established and more experienced. This can be of concern to a company where from our histogram we saw the majority of colleagues have 10 years or less service. These could potentially be most at risk of attrition.

`f = plt.figure(figsize=(6,3))`

`fig = sns.scatterplot(x="YearsAtCompany", y="Attrition", data=attrition)`

`sns.set()`

`plt.show()`

![scatterplot-2.png](attachment:scatterplot-2.png)

In [3]:
#libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #for plotting
import sys
import seaborn as sns

attrition = pd.read_csv("HR-Employee-Attrition.csv")

#### Swarm Plot
****

A swarm plot is a type of scatter plot that is used for representing categorical values. It is very similar to the strip plot, but it avoids the overlapping of points. We can use the seaborn imported as sns to create a swarm plot graphs. Focussing on the theme of attrition versus years of service for continuity of modelling the data for this project I was able to add in an additonal layer that I didn't have in the basic scatter plot using the "hue" function which visually highlights the males and females on the graph. As the data points don't overlap we are given a better representation of the distribution of values, but it does not scale well to large numbers of observations. This style of plot is sometimes called a “beeswarm”.[https://seaborn.pydata.org/generated/seaborn.swarmplot.html]

![swarmplot.png](attachment:swarmplot.png)

The Seaborn tutorial website recommends the use of a box plot or violin plot alongside a swarm plot. Given the shape my data had graphed I opted to use the violin plot. This can be mapped on top of the swarm plot or plotted separately. A violin plot shows the distribution of quantitative data across variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution. This can be an effective and attractive way to show multiple distributions of data at once, but Seaborn warns that the estimation procedure is influenced by the sample size, and violins for relatively small samples might look misleadingly smooth.[https://seaborn.pydata.org/generated/seaborn.violinplot.html]. We can visually see where the peaks, valleys, and tails of each density curve can be compared to see where groups are similar or different.

![violinplot.png](attachment:violinplot.png)

#### Strip plot
A strip plot is a single-axis scatter plot that is used to visualise the distribution of many individual one-dimensional values. The values are plotted as dots along one unique axis, and the dots with the same value can overlap. To demonstrate the benefit if the strip plot I included the variable of the job title or position.[https://datavizproject.com/data-type/strip-plot/]. Several Strip plots are placed side by side to compare the distributions of data points amongst the values in the variable. The strip plot is good to use alongside the violinplot in cases where all observations are shown along with some representation of the underlying distribution. 


****

#### Conclusion
The aim of this analysis was to describe the dataset using Python, in terms of its characteristics and the relationships between the variables.

#### Learnings
During this project, I got a high-level introduction to data analysis for Python and what can be achieved using the libraries built for this purpose. Statistical and data analytical skills are now in high demand. As I have an interest in and work in an area collecting and analysing data this project has bettered my understanding of the basic statistical concepts and how I can apply them them to other data sets. 

My background is in engineering and mathematics with minimal coding. A lot of the learnings I achieved would be considered basic to some but it gave me an very good basis for navigating command environments and trouble shooting errors. For example I had difficulty with this notebook not connecting to the kernel, I had to do a lot of research and troubleshooting including using anaconda prompt to update Jupyter to resolve the issue. These basics will be invaluable to me as whilst currently I work in Excel and PowerBi we are finding these prgrammes not as efficient and are migrating over to SQL and with a view to seeing if we can include Python to our workspace. 

Machine learning is an important component of the growing field of data science. Through the use of statistical methods, algorithms are trained to make classifications or predictions, uncovering key insights within data mining projects. These insights will subsequently drive decision making in industry, in their aim to meet key growth metrics. As big data continues to expand and grow, the market demand for data scientists will increase, requiring them to assist in the identification of the most relevant business questions and subsequently the data to answer them.

#### Further References
https://www.kaggle.com/datasets
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/
https://www.ibm.com/cloud/learn/machine-learning?msclkid=3efe2586ceb411ec87756ddd7a7b13ed
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet?msclkid=29ec18edcec111ecb98cc8dd61a84ff0

