# Python for data analysis and visualisation

## Online short course at the University of St Andrews

## Final assignment

### [Ryan Stuart]

## Introduction
All the basic requirements have been met and only the easy additional requirements. The only problem which was encountered was in creating the 
function: check_values_of_variables_are_admissible. This was overcome by converting the column dtypes into string. All of the source code in the refinedata script was my own apart from using [stackoverflow](https://stackoverflow.com/questions/37769435/understanding-python-map-function-range) to understand the map function.

The reproducibility and reusability of my analysis should be to a high standard as the code has been organised into well-defined functions with docstrings and comments. Additionally, Git version control was used to manage and document changes containing a README.txt in the repository with instructions to reproduce my analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('../data/Refined_Scotland_teaching_file_1PCT.csv')

In [None]:
df

The code written to check if the values of variables were of expected format was written by myself,
The code to check their admissibility was achieved by reading through [stackoverlow](https://stackoverflow.com/) many others work and suggestions. As there was a TypeError: '>=' not supported between instances of 'str' and 'int' for some of the variables as they have an object dtype. as stated above in the Introduction.

## perform the descriptive analysis of the dataset:

In [None]:
def check_format(df):
    """
    The check_format function reiterates the column dtypes.
    """
    for column in df.columns:
        if df[column].dtype == 'int64':
            print(f"{column}: integer")
        elif df[column].dtype == 'object':
            print(f"{column}: object")
        else:
            print(f"{column}: {df[column].dtype}")

check_format(df)

- The total number of records in the refined dataset are: 63388
- All the different values for each variable and their occurence in **`DESCENDING ORDER`** (from left to right) in the refined dataset are:

`RESIDENCE_TYPE:` 
**P:**      62239
**C:**      1149

`Family_Composition:`
**1:**    33337
**0:**    11716
**4:**     7757
**2:**    7660
**X:**     1149
**3:**     1019
**5:**      750

`sex:`
**2:**    32696
**1:**    30692

`age:`
**1:**    10980
**5:**     9336
**4:**     8963
**3:**     8056
**6:**     7854
**2:**     7541
**7:**     5731
**8:**     4927

`Marital_Status:`
**1:**    29611
**2:**    23918
**4:**     4159
**5:**     4032
**3:**     1668

`student:`
**2:**    51397
**1:**    11991

`Country_Of_Birth:`
**1:**    59045
**2:**     4343

`health:`
**1:**    33436
**2:**    18825
**3:**     7544
**4:**     2759
**5:**      824

`Ethnic_Group:`
**1:**    60901
**3:**     1667
**4:**      376
**2:**      199
**6:**      165
**5:**       80

`religion:`
**2:**    34199
**1:**    23309
**9:**     4335
**6:**      906
**4:**      187
**8:**      162
**7:**      116
**3:**      113
**5:**       61

`Economic_Activity:`
**1:**    25350
**5:**    11527
**X:**    10980
**2:**     3623
**6:**     2691
**8:**     2543
**3:**     2183
**4:**     1743
**7:**     1741
**9:**     1007

`Occupation:`
**X:**    14435
**9:**     7256
**2:**     7237
**5:**     6140
**4:**     6010
**3:**     5015
**7:**     4934
**6:**     4484
**8:**     4327
**1:**     3550

`industry:`
**X:**     14435
**4:**      7557
**11:**     6817
**2:**      6318
**8:**      4818
**10:**     4123
**3:**      3876
**6:**      3754
**5:**      3309
**9:**      3225
**7:**      1979
**13:**     1172
**12:**     1107
**1:**      898

`Hours_Worked_Per_Week:`
**X:**    32851
**3:**    18333
**2:**     6518
**4:**     3543
**1:**     2143

`Approximate_Social_Grade:`
**2:**   15607
**4:**    14709
**X:**    12090
**3:**    11602
**1:**     9380

## Minimal requirements, build the following plots:

### Bar chart for the number of records for each age group
The statement below applies to both bar graphs regarding age and occupation.

Majority of the code below was my own, apart from the annotating bar graphs with the values assigned to the variables. This was also achieved by researching and looking at examples on [stackoverflow](https://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots).

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Import refined dataset for analysis
data = pd.read_csv('../data/Refined_Scotland_teaching_file_1PCT.csv')
df = pd.DataFrame(data)

# format graph
with plt.style.context('fivethirtyeight'):
   ax = df['age'].value_counts().plot(kind='barh')
plt.xlabel('age')
plt.ylabel('Number of records')
plt.title('Number of records in each age group: census 2011')

# Annotate the bars in bar graph
for p in ax.patches:
    ax.annotate(str(p.get_width()), (p.get_x() + p.get_width(), p.get_y()), xytext=(5, 10), textcoords='offset points')
    
plt.show()

### Bar chart for the number of records for each occupation

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Import refined dataset for analysis
data = pd.read_csv('../data/Refined_Scotland_teaching_file_1PCT.csv')
df = pd.DataFrame(data)

# format graph
with plt.style.context('fivethirtyeight'):
    ax = df['Occupation'].value_counts().plot(kind='barh')
plt.xlabel('Occupation')
plt.ylabel('Number of records')
plt.title('Number of records in each occupation: census 2011')

# Annotate the bars in bar graph
for p in ax.patches:
    ax.annotate(str(p.get_width()), (p.get_x() + p.get_width(), p.get_y()), xytext=(5, 10), textcoords='offset points')
    
plt.show()

## Additional Requirements
I used the link from the assignment pdf for the [Matplotlib library](http://matplotlib.org/), to help gain insight and improve my data visulisation abilities.

###  Pie chart for the percentage of records for each general health descriptor

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Import refined dataset for analysis
data = pd.read_csv('../data/Refined_Scotland_teaching_file_1PCT.csv')
df = pd.DataFrame(data)

# Create and format pie chart
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(8,8))
df['health'].value_counts().plot(kind='pie', autopct='%1.2f%%')

plt.title('The percentage of records for each general health descriptor')
# y label = "" to remove count from the y axis
plt.ylabel("")
plt.legend()

plt.show()

### Pie chart for the percentage of records for each ethnic group
By using the [matplotlib](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html) link provided in the final assignment documentation, I was able to research how to code labels on the diagram as the values were overlapping making them unreadable. I sourced the code from the embedded hyperlink and manipulated it so that it supported my data visualisation.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Import refined dataset for analysis
data = pd.read_csv('../data/Refined_Scotland_teaching_file_1PCT.csv')

# Create the pie chart
plt.figure(figsize=(6, 8))
wedges, _ = plt.pie(data['Ethnic_Group'].value_counts(), wedgeprops=dict(width=0.5), startangle=-40)

# Annotate wedges with percentages
for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1) / 2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    plt.annotate(f'{i + 1} ({100 * data["Ethnic_Group"].value_counts(normalize=True).values[i]:.2f}%)',
                 xy=(x, y), xytext=(1.35 * np.sign(x), 1.5 * y + i * 0.05),
                 horizontalalignment=horizontalalignment, arrowprops=dict(arrowstyle="-"))

# Add legend and title
plt.legend(data['Ethnic_Group'].unique(), title="Ethnic Group", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.title('Percentage of records for each general health descriptor')

plt.show()