### LSE Data Analytics Online Career Accelerator

# Course 2: Data Analytics using Python

## Assignment: Diagnostic Analysis using Python

You’ll be working with real-world data to address a problem faced by the National Health Service (NHS). The analysis will require you to utilise Python to explore the available data, create visualisations to identify trends, and extract meaningful insights to inform decision-making. 

# 

# 

### Prepare your workstation

# Import the necessary libraries.
import pandas as pd
import numpy as np

# Optional - Ignore warnings.
import warnings
warnings.filterwarnings('ignore')

# Import and sense-check the actual_duration.csv data set as ad.

ad = pd.read_csv('actual_duration.csv')

# View the DataFrame.

ad.head()

# Determine whether there are missing values.

missing_check = ad.isnull()

missing_check2 = missing_check.astype(bool)

missing_check.value_counts()

# Determine the metadata of the data set. 

ad.info()

# Determine the descriptive statistics of the data set.

ad.describe()


# Import and sense-check the appointments_regional.csv data set as ar.

ar = pd.read_csv('appointments_regional.csv')

# View the DataFrame.

ar



# Determine the metadata of the data set.

ar.info()


# Determine the descriptive statistics of the data set.

ar.describe()



#Determine whether there are missing values 

missing_check_ar = ar.isnull()

missing_check_ar2 = missing_check_ar.astype(bool)

missing_check_ar2.value_counts()

# Import and sense-check the national_categories.xlsx data set as nc.

nc = pd.read_csv('national_categories_CSV_Version.csv')

# View the DataFrame.

nc 


# Determine whether there are missing values.


missing_check_nc = nc.isnull()

missing_check_nc2 = missing_check_nc.astype(bool)

missing_check_nc.value_counts()

# Determine the metadata of the data set.

nc.info()


# Determine the descriptive statistics of the data set.

nc.describe()


### Explore the data set

**Question 1:** How many locations are there in the data set?

# Determine the number of locations. 106

SEL01 = ad.groupby('sub_icb_location_code')['count_of_appointments'].count()

SEL01


#Answer = 106 

**Question 2:** What are the five locations with the highest number of records?



# Determine the top five locations based on record count.

SEL02 = ad.groupby('sub_icb_location_code').sum()

Top_5_Locations = SEL02.sort_values(by='count_of_appointments',ascending=False)

#Answer = 

Top_5_Locations.head()

**Question 3:** How many service settings, context types, national categories, and appointment statuses are there?

# Determine the number of service settings.

nc.groupby('service_setting').count()

#Answer = 5 

# Determine the number of context types.

nc.groupby('context_type').count()

#Answer = 3 


# Determine the number of national categories 

nat_cat_count_table = nc.groupby('national_category').count() 

#In the other questions the tables were so short I thought they provided a clear answer. 
#This one was longer, so I used the code below to count the columns, giving us the number of National Categories.  

nat_cat_count_table[nat_cat_count_table.columns[0]].count()

#Answer = 18 

# Determine the number of appointment status.

ar.groupby('appointment_status').count() 

#Answer = 3 


# 

# Assignment activity 3

### Continue to explore the data and search for answers to more specific questions posed by the NHS.

**Question 1:** Between what dates were appointments scheduled? 

#Between what dates were appointments scheduled?

#ANSWER 2021-08-01, 2022-06-30 

#Methodology 


# Import modules and classes.
 
import datetime 
from datetime import datetime, date



#Sense-check the DataFrames with dtypes and the head() method

nc.dtypes

nc.info()


#Change the data format to a DateTime format. 

nc['appointment_date'] = pd.to_datetime(nc['appointment_date']) 


#Determine Earliest and Latest Dates 

print (nc['appointment_date'].min(),nc['appointment_date'].max())

**Question 2:** Which service setting was the most popular for NHS North West London from 1 January to 1 June 2022?

#Which service setting reported the most appointments in North West London from 1 January to 1 June 2022? 

#Answer = General Practice

#Methodology


#Create a subset of the nc DataFrame (e.g. nc_subset) which contains only information from North West London 

nc_subset = nc[nc['sub_icb_location_name'] =='NHS North West London ICB - W2U3Z']

#Group the NW Ldn Dataframe by Service Setting

nc_subset2 = nc_subset.groupby('service_setting')


#Order the previous DF by count of appointments 

nc_subset3 = nc_subset2.sum()

#Change the count of appointments from highest to lowest. 

nc_subset3

nc_subset3[["count_of_appointments"]].sort_values(by=["count_of_appointments"], ascending = False)


**Question 3:** Which month had the highest number of appointments?

#Which month had the highest number of appointments?

#Answer - November 2021

#Methodology:

#Create a working subset of NC. Use groupby and sum to group by month. 

nc_subset4 = nc.groupby('appointment_month')

nc_subset5 = nc_subset4.sum()

#Order by highest number of appointments per month 

nc_subset5[["count_of_appointments"]].sort_values(by=["count_of_appointments"], ascending = False)

**Question 4:** What was the total number of records per month?

# Total number of records per month.



records_per_month = nc.groupby("appointment_month").count()


records_per_month ["count_of_appointments"]


# 

# Assignment activity 4

### Create visualisations and identify possible monthly and seasonal trends in the data.

# Import the necessary libraries.
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size.
sns.set(rc={'figure.figsize':(15, 12)})

# Set the plot style as white.
sns.set_style('white')

### Objective 1
Create three visualisations indicating the number of appointments per month for service settings, context types, and national categories.

#Aggregate Data for Appointments Per Month By Service Setting

aa41 = nc.groupby(['service_setting','appointment_month']).sum()

#Exclude General Practice

aa411 = aa41.drop("General Practice",axis=(0))

aa411

# View output.


**Service settings:**

#Make a Lineplot For Amount of Appointments per Service Setting

LP1 = sns.lineplot(data=aa411,x="appointment_month", y="count_of_appointments",hue="service_setting",).set(title='Appointments Per Month By Service Setting')

_ = plt.setp(plt.xticks()[1], rotation=45)

plt.savefig("Appointments Per Month By Service Setting")



**Context types:**

#Make a Lineplot For Amount of Appointments per Context Type

#Make a seperate working data set ('aa42')

aa42 = nc.groupby(['context_type','appointment_month']).sum()

#Show the data set

aa42

#Create a Lineplot of amount of Appointments per Context Type

LP2 = sns.lineplot(data=aa42,x="appointment_month", y="count_of_appointments",hue="context_type").set(title='Appointments Per Month By Context Type')

_ = plt.setp(plt.xticks()[1], rotation=45)



plt.savefig("Appointments Per Month By Service Setting")


**National categories:**

# Create a working data set. 



aa43 = nc.groupby(['national_category','appointment_month']).sum()

#Show the data set

aa43





#Make a lineplot that shows different National Categories per Context Type


LP3 = sns.lineplot(data=aa43,x="appointment_month", y="count_of_appointments",hue="national_category",).set(title='Appointments Per Month By NC')

_ = plt.setp(plt.xticks()[1], rotation=45)

#sns.move_legend(
    #bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False)

plt.rcParams['figure.figsize']=(40,40)

plt.savefig("Appointments Per Month By NC")






### Objective 2
Create four visualisations indicating the number of appointments for service setting per season. The seasons are summer (August 2021), autumn (October 2021), winter (January 2022), and spring (April 2022).

**Summer (August 2021):**

# Create a separate data set that can be used in future weeks. 


# View output.


# Look at August 2021 in more detail to allow a closer look.

#Use Service Setting DF from previous question 

aa411_to_group = aa411.groupby(['appointment_month','service_setting']).sum()

aa411_to_group

summer_2021 = aa411_to_group.loc['2021-08'].reset_index()

summer_2021

# Create a lineplot.

sns.barplot(data=summer_2021,x='service_setting', y='count_of_appointments').set(title='Summer 2021 Service Settings')



plt.savefig("Summer 2021 Service Settings")







**Autumn (October 2021):**

# Look at October 2021 in more detail to allow a closer look.
# Create a lineplot.

summer_2021 = aa411_to_group.loc['2021-10'].reset_index()

summer_2021

# Create a lineplot.

sns.barplot(data=summer_2021,x='service_setting', y='count_of_appointments').set(title='Autumn 2021 Service Settings')



plt.savefig("Autumn 2021 Service Settings")


**Winter (January 2022):**

# Look at January 2022 in more detail to allow a closer look.
# Create a lineplot.

summer_2021 = aa411_to_group.loc['2022-01'].reset_index()

summer_2021

# Create a lineplot.

sns.barplot(data=summer_2021,x='service_setting', y='count_of_appointments').set(title='Winter 2022 Service Settings')


plt.savefig("Winter 2022 Service Settings")



**Spring (April 2022):**

# Look at April 2022 in more detail to allow a closer look.
# Create a lineplot.

summer_2021 = aa411_to_group.loc['2022-04'].reset_index()

summer_2021

# Create a lineplot.

sns.barplot(data=summer_2021,x='service_setting', y='count_of_appointments').set(title='Spring 2022 Service Settings')


plt.savefig("Spring 2022 Service Settings")



# 

# Assignment activity 5

### Analyse tweets from Twitter with hashtags related to healthcare in the UK.

# Libraries and settings needed for analysis
import pandas as pd
import seaborn as sns

# Set figure size.
sns.set(rc={'figure.figsize':(15, 12)})

# Set the plot style as white.
sns.set_style('white')

# Maximum column width to display
pd.options.display.max_colwidth = 200

# Load the data set.

tweets = pd.read_csv('tweets.csv')

# View the DataFrame.

tweets




# Explore the data set.


tweets.describe()




tweets.info()

#Do you think it is useful to look at these columns in more detail? Explain your answer.

#Yes. This answer is only giving us a descending list of favorited / retweet amount per tweet. 

#We don't have much understanding of the possible link between contents of the tweet and it's favorited/ retweeted amount. 

#Nor have we looked at possible other factors such as time of day/ location etc.








#Create a dataframe for hashtagged values 

tweets_text = tweets[["tweet_entities_hashtags"]]


tweets_text

#Create a variable (tags,) and assign an empty list to it.

tags = ['']


for y in [x.split(' ') for x in tweets['tweet_full_text'].values]:
      for z in y:
        if '#' in z:
            # Change to lowercase.
            tags.append(z.lower())
             # print(y)

#Convert the new Series into a DataFrame (e.g. data). Remember to reset the index.

data = pd.DataFrame(tags).reset_index()

data

#Count the Occurances of the Hashtags 

data_renamed = data.rename(columns={'index':'count',0:'word'})

data_renamed = data_renamed.groupby(['word']).count().reset_index()


#Sort the counted in descending order 

counted = data_renamed.sort_values(['count'], ascending=False)

#Collect hashtags abvoe 11 and drop the healthcare hashtag. 
#NOTE: I decided on 11 instead of 10 as it cleans up the resultant barplot, rendering it more readable, and still shows the trends related to healthcare. 

counted = counted[counted['count'] > 11]

counted_final = counted[counted['count'] < 100]

counted_final

#Drop"drugs\n\n#tipsfornewdocs"



#Create a  Barplot Showing Trending Hashtags 

plt.rcParams['figure.figsize']=(20,20)


sns.barplot(y ='word', x='count',data=counted_final).set(title='Trending Topics on Twitter')

_ = plt.setp(plt.xticks()[1], rotation=45)

sns.set(font_scale=2)

plt.savefig("Healthcare -  A Trending Topic")



# 

# Assignment activity 6

### Investigate the main concerns posed by the NHS. 

# Prepare your workstation.
# Load the appointments_regional.csv file.


ar_agg = ar[['count_of_appointments','hcp_type', 'appointment_status', 'appointment_mode', 
'time_between_book_and_appointment','appointment_month']]
    
ar_agg_sum = ar_agg.groupby(['count_of_appointments']).agg('sum').reset_index()

#Renamed the DF to a clearer name for further work

ar_appnmnts_summed = ar_agg_sum

#ASK DAD ABOUT TAIL? 

#Check DF

ar_appnmnts_summed


# View the DataFrame.


#Create a new DataFrame (e.g. ar_df).
#Use the groupby() function to determine the total number of appointments per month. 
#Remember to reset the index. 
#(Hint: The new DataFrame will contain the appointment_month and count_of_appointments columns.)
#Add a new column (e.g. utilisation) to the DataFrame indicating the average utilisation of service.
#The average can be calculated by dividing the sum of the monthly appointments by 30 to get a daily value. 
#The NHS can accommodate a maximum of 1,200,000 appointments per day.
#Round the value to one decimal place.
#View the DataFrame.



ar_df = ar[["appointment_month","count_of_appointments"]]

#Use the groupby() function to determine the total number of appointments per month. Remember to reset the index. 

ar_df_2 = ar_df.groupby(['appointment_month']).agg('sum').reset_index()

ar_df_2

#Add Utilisation column

ar_df_2=pd.DataFrame(ar_df_2,columns=['appointment_month','count_of_appointments','utilisation'])

#The average can be calculated by dividing the sum of the monthly appointments by 30 to get a daily value.

ar_df_2["utilisation"] = (ar_df_2["count_of_appointments"]/30).round(1)

#Round the value to one decimal place.

#Renamed the DF to a clearer name for further work

ar_apt_count_vs_utilisation = ar_df_2 


ar_apt_count_vs_utilisation

**Question 1:** Should the NHS start looking at increasing staff levels? 

#Create two lineplots with Seaborn:
#Change the datatype of appointment_month for both DataFrames to string for ease of visualisation.

ar_apt_count_vs_utilisation['appointment_month'] = ar_apt_count_vs_utilisation['appointment_month'].astype("string")

#ar_appnmnts_summed['appointment_month'] = ar_appnmnts_summed['appointment_month'].astype("string")

ar_apt_count_vs_utilisation['month_names'] = pd.to_datetime(ar_apt_count_vs_utilisation['appointment_month'], format = '%Y-%m').dt.month_name()

ar_apt_count_vs_utilisation



#Create a lineplot indicating the number of monthly visits. 

#NOTE: I have interpreted the ambiguous 'visits' as 'appointments'. For Information on attended appointments, see below. 

plt.rcParams['figure.figsize']=(10,10)

sns.set(font_scale=1)

sns.lineplot(data = ar_apt_count_vs_utilisation, x = 'appointment_month' , y = 'count_of_appointments').set(title='Appointments Per Month')

_ = plt.setp(plt.xticks()[1], rotation=45)








#Create a lineplot indicating the monthly capacity utilisation

plt.rcParams['figure.figsize']=(10,10)

sns.set(font_scale=1)

sns.lineplot(data = ar_apt_count_vs_utilisation, x = 'appointment_month' , y = 'utilisation').set(title='Utilisation Per Month')

_ = plt.setp(plt.xticks()[1], rotation=45)






#CONCLUSIONS

**Question 2:** How do the healthcare professional types differ over time?

#NOTE: Healthcare professional type is displayed in the data set as hcp_type.

#Make a DF with the relevant columns from 'ar'

ar_hcp = ar[["appointment_month","hcp_type","count_of_appointments"]]

#Use 'Size()' to get a count of each appointment type 

ar_hcp_2 = ar_hcp.groupby(['appointment_month','hcp_type']).agg({'count_of_appointments':'sum'})

#Rename column

ar_hcp_2.head()




#Make a LinePlot

plt.rcParams['figure.figsize']=(10,10)

sns.lineplot(data = ar_hcp_2, x = 'appointment_month' , y = 'count_of_appointments', hue = 'hcp_type').set(title='HCP Types Over Time')

_ = plt.setp(plt.xticks()[1], rotation=45)



**Question 3:** Are there significant changes in whether or not visits are attended?

#Make a working DataFrame for attended visits only. Find the average of attended visits. 

ar_attendance = ar.groupby(['appointment_month','appointment_status'], as_index = False).agg({'count_of_appointments':'sum'})


ar_att_2 = ar_attendance.loc[ar_attendance['appointment_status'] == 'Attended']

#Get Mean for attendance 

ar_att_2.describe().loc['mean']


#Make a Lineplot and draw a line of the average appointment count acros it. 

plt.rcParams['figure.figsize']=(10,10)

sns.lineplot(data = ar_att_2, x = 'appointment_month' , y = 'count_of_appointments').set(title='Attended Appointments Per Month VS Average')

_ = plt.setp(plt.xticks()[1], rotation=45)

plt.plot([0, 29], [2.259186e+07, 2.259186e+07], linewidth=2)


#Make a working dataframe, select unattended visits. 

ar_non_attendance = ar.groupby(['appointment_month','appointment_status'], as_index = False).agg({'count_of_appointments':'sum'})


ar_non_att_2 = ar_non_attendance.loc[ar_non_attendance['appointment_status'] == 'DNA']

#Get Mean for attendance 

ar_non_att_2.describe().loc['mean']


#Make a Lineplot and draw a line of the average DNA count acros it. 

plt.rcParams['figure.figsize']=(10,10)

sns.lineplot(data = ar_non_att_2, x = 'appointment_month' , y = 'count_of_appointments').set(title='DNA Appointments VS Averge')

_ = plt.setp(plt.xticks()[1], rotation=45)
plt.plot([0, 29], [1.030374e+06, 1.030374e+06], linewidth=2)




#Make a DF that includes DNA and Attended Values 


ar_attendance_final = ar.groupby(['appointment_month','appointment_status'], as_index = False).agg({'count_of_appointments':'sum'})


ar_att_final_2=ar_attendance_final.loc[(ar_attendance_final['appointment_status'] == 'Attended') | (ar_attendance_final['appointment_status'] == 'DNA')]

          

#Make Lineplot that shows both values 

plt.rcParams['figure.figsize']=(10,10)

sns.lineplot(data = ar_att_final_2, x = 'appointment_month' , y = 'count_of_appointments', hue = 'appointment_status').set(title='Appointment Attendence By Month')

_ = plt.setp(plt.xticks()[1], rotation=45)
plt.plot([0, 29], [1.030374e+06, 1.030374e+06], linewidth=2)
plt.plot([0, 29], [2.259186e+07, 2.259186e+07], linewidth=2)



          

**Question 4:** Are there changes in terms of appointment type and the busiest months?

#Reuse ar_hcp_2 DF and sense-check it

ar_hcp_mode = ar[["appointment_month","appointment_mode","count_of_appointments", "appointment_status","hcp_type"]]

#

ar_hcp_2_mode = ar_hcp.groupby(['appointment_month','appointment_mode',"hcp_type","appointment_status"]).agg({'count_of_appointments':'sum'})

#Rename column

ar_hcp_2_mode.head()







#Make a LinePlot

plt.rcParams['figure.figsize']=(20,20)

sns.lineplot(data = ar_hcp_2_mode, x = 'appointment_month' , y = 'count_of_appointments', hue = 'appointment_mode',ci=None)

_ = plt.setp(plt.xticks()[1], rotation=45)






**Question 5:** Are there any trends in time between booking an appointment?

#Make a working DF 

ar_timebook = ar.groupby(['appointment_month','time_between_book_and_appointment'], as_index = False).agg({'time_between_book_and_appointment':'sum'})

ar_timebook



#Get a count of each appointment type




ar_timebook = ar[["appointment_month","time_between_book_and_appointment","count_of_appointments"]]



ar_timebook_2 = ar_timebook.groupby(['appointment_month','time_between_book_and_appointment']).agg({'count_of_appointments':'sum'})

#Rename column

ar_timebook_2.head()


#Make a LinePlot 


plt.rcParams['figure.figsize']=(28,28)

sns.lineplot(data = ar_timebook_2, x = 'appointment_month' , y = 'count_of_appointments', hue = 'time_between_book_and_appointment',ci=None).set(title='Monthly Time Between Book & Appointment')

_ = plt.setp(plt.xticks()[1], rotation=45)




**Question 6:** How do the spread of service settings compare?

#How do the various service settings compare?


#View the national_category.csv DataFrame you created in an earlier assignment activity.
#Create a new DataFrame and group the month of appointment and number of appointments.
#View the DataFrame.
#Create a suitable visualisation in Seaborn based on the new DataFrame,
#to indicate the service settings for the number of appointments. 
#Create a second visualisation in Seaborn where you concentrate on all the service settings excluding GP visits


nc.head()

nc_grp = nc.groupby(['appointment_month','count_of_appointments','service_setting'], as_index = False).agg({'count_of_appointments':'sum'})

nc_grp




#Make a Lineplot

plt.rcParams['figure.figsize']=(20,20)

sns.lineplot(data = nc_grp, x = 'appointment_month' , y = 'count_of_appointments', hue = 'service_setting',ci=None).set(title='Service Setting by Month')

_ = plt.setp(plt.xticks()[1], rotation=45)

plt.savefig("All Service Setting by Month")





#Create a second visualisation in Seaborn where you concentrate on all the service settings excluding GP visits

#Create a dataframe that excludes GP visits 

nc_no_gp = nc.groupby(['appointment_month','count_of_appointments','service_setting'], as_index = False).agg({'count_of_appointments':'sum'})

#Exclude GP visits 

nc_no_gp_2 = nc_no_gp.loc[(nc_no_gp['service_setting'] == 'Extended Access Provision')| (nc_no_gp['service_setting'] == 'Other') |  (nc_no_gp['service_setting'] == 'Primary Care Network') |  (nc_no_gp['service_setting'] == 'Unmapped')] 




plt.rcParams['figure.figsize']=(20,20)

sns.lineplot(data = nc_no_gp_2, x = 'appointment_month' , y = 'count_of_appointments', hue = 'service_setting',ci=None).set(title='Non-GP Service Setting by Month')

_ = plt.setp(plt.xticks()[1], rotation=45)


plt.savefig("Non-GP Service Setting by Month")





# 

### Provide a summary of your findings and recommendations based on the analysis.

> Double click to insert your summary.