## Assignment 4: Computational Code and Output
### Exploring Gender Diversity in Universities
---
<br>


## Introduction
This project uses the [World University Rankings 2023](https://www.kaggle.com/datasets/alitaqi000/world-university-rankings-2023) Dataset sourced from Kaggle (Syed Ali Taqi). The data was collected by Syed Ali Taqi, who used Python to scrape data from the web, manipulate it and compile it into a meaningful form. In this dataset, we have 13 different variables and 2341 observations (including NAs). The variables are:
<br><br>

| | Variable | Variable Type | Description |
|---| -------- | ------- | --- |
|1| University rank  | chr  |Rank of specific university all over the world|
|2| University name | chr    |Specific name of University|
|3| Location | chr |Physical place where university exists|
|4| No. of students | chr |Present number of students enrolled in university as of 2023|
|5| No. of students per staff |dbl |Number of students under one Professor|
|6| International students |chr  |Total number of International Students|
|7| Female : male ratio |chr  |A ratio of female to male students respectively|
|8| Overall score | chr | The combined weighted scores of those given below. Out of 100|
|9| Teaching score | chr |The percieved prestige of the institution based on the Academic Reputation Survey. Out of 100.|
|10| Research score | chr |Reputation for research excellence amongst peers based on the Academic Reputation Survey. Out of 100|
|11| Citations score | chr |The number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years. Out of 100.|
|12| Industry income score | chr |How much money a university receives from the working industry in exchange for its academic expertise. Out of 100|
|13| International outlook score | chr |The ability of a university to attract undergraduates, postgraduates and faculty from all over the globe.|


<br><br>
The dataset will be used to answer the question **Does a university's location and student body makeup influence the number of female students?** We will explore the relationship between international and female students at universities and investigate whether there are any regional variations in this relationship. This research question aims at inference, seeking to identify the demographic factors that impact the gender balance of students.  
<br>
Regarding hypotheses, we will test the following:

**Null Hypothesis ($H_0$):** Regions with a large number of international students do not have a large number of female students.

**Alternative Hypothesis ($H_1$):** Regions with a large number of international students have a large number of female students.

<br>
    Though the dataset contains many variables, these were chosen for their theoretical relevance to the research question. Geographical location influences cultural, social, and economic factors that can impact gender dynamics in student populations. Similarly, the presence of international students contributes to diversity on campus, potentially influencing female enrollment rates. Using just these will also allow for a comprehensive, in depth understanding of their relationships without risk of creating overcomplicated models or using data with hidden correlations. 

<br><br>

---
## Exploratory Data Analysis
Before conducting inference, we will perform Exploratory Data Analysis (EDA) to gain insights about the dataset. This will involve reading in the data, cleaning, and wrangling it into a tidy format, and generating summary statistics and visualizations. The visualization below **(Figure 1)** provides a clear demonstration of the relationship between the total number of female students (pink) and international students (green) by region. Upon examination, regions with the highest number of international students, such as the European Union and North America, also tend to have the highest number of female students. Conversely, the region with lowest numbers of international students, the Caribbean, has the second lowest number of female students. This suggests a potential relationship worth exploring further between the variables of interest.

<br>

**Figure 1: Diverging Bar Chart Showing the Total International and Female students for each Region.**

<div style="text-align:center;">
    <img src="https://raw.githubusercontent.com/Kaylan-W/STAT-301-Project/main/Female_vs_International.png" width="600" height="700" alt="combined_bar" style="padding: 0;">

In [6]:
# install libraries
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(cowplot)

raw_data <- read.csv("https://raw.githubusercontent.com/Kaylan-W/STAT-301-Project/main/world_university_rankings_2023.csv")

In [7]:
# check how many rows have missing data
na_count <- nrow(raw_data) - sum(complete.cases(raw_data))
# na_count

# remove rows with missing data, select columns of interest
filtered <- raw_data %>%
    filter(!is.na(International.Student)) %>%    # remove rows with n/a values for international count
    filter(!(Female.Male.Ratio == "n/a")) %>%    # remove rows with n/a values for the ratio
    filter(!(University.Rank == "Reporter")) %>% # remove rows where the rank is reporter
    select("Name.of.University","Location","No.of.student",
           "International.Student","Female.Male.Ratio")

# rename columns
col_names <- c("name","location","num_students", "pct_international", "f_to_m_ratio")  
colnames(filtered) <- col_names 

# separating the column f_to_m_ratio into two columns 
filtered <- separate(filtered, f_to_m_ratio, into = c("pct_female", "pct_male"), sep = " : ")  

# convert chr columns to numeric
filtered$pct_female <- as.numeric(filtered$pct_female) 
filtered$pct_male <- as.numeric(filtered$pct_male)  

# convert location column to factor
filtered$location <- as.factor(filtered$location)

filtered_loc <- filtered %>%
        filter(!(location == "n/a"))

# approximate the number of international students and students of each gender, to 0 decimal places 
est_int_per_gender <- filtered_loc %>% 
        mutate(approx_int = (num_students * (pct_international/100)) %>% round(0)) %>%
        mutate(approx_int_fem = (num_students * (pct_international/100) * (pct_female/100))%>% round(0))%>%
        mutate(approx_int_male = (num_students * (pct_international/100) * (pct_male/100)) %>% round(0))%>%
        mutate(approx_fem = (num_students * (pct_female/100)) %>% round(0)) %>%
        mutate(approx_male = (num_students * (pct_male/100)) %>% round(0))

In [8]:
# define regions of the world
n_africa <- c("Algeria", "Egypt", "Morocco", "Tunisia") 
s_africa <- c("Botswana", "Namibia", "South Africa")
e_africa <- c("Ethiopia", "Kenya", "Uganda", "Mozambique", "Zambia", "Tanzania", "Zimbabwe",  "Mauritius")
w_africa <- c("Ghana", "Nigeria")

caribbean <- c("Jamaica", "Cuba", "Puerto Rico")

n_america <- c("Canada", "United States")
c_america<- c("Costa Rica", "Mexico") 
s_america <- c("Argentina", "Brazil", "Chile", "Colombia", "Ecuador", "Peru", "Venezuela")

europe <- c("Croatia", "Montenegro", "Iceland", "Norway", "Serbia", "Switzerland", "Ukraine", 
            "United Kingdom")
eu <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", 
        "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", 
        "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden")

middle_east <- c("Iran", "Israel", "Jordan", "Oman", "Qatar", "Saudi Arabia", "United Arab Emirates")

oceania <- c("Australia", "Fiji", "New Zealand")

c_asia    <- c("Kazakhstan") 
s_asia <- c("Sri Lanka", "Bangladesh", "India", "Pakistan", "Nepal")
se_asia  <- c("Brunei", "Brunei Darussalam", "Indonesia", "Malaysia", "Philippines", "Singapore", 
              "Thailand", "Vietnam")
e_asia <- c("China", "South Korea", "Japan", "Hong Kong", "Taiwan")
w_asia  <- c("Georgia", "Azerbaijan", "Turkey", "Northern Cyprus", "Lebanon", "Iraq")


In [9]:
# assign each university to a region
est_int_per_gender$region <- NA

est_int_per_gender <- est_int_per_gender %>%
  mutate(region = case_when(
      location %in% n_africa ~ "Northern Africa" , 
      location %in% s_africa ~ "Southern Africa" ,
      location %in% e_africa ~ "Eastern Africa" ,
      location %in% w_africa ~ "Western Africa" ,
      location %in% caribbean ~ "Caribbean" ,
      location %in% n_america ~ "North America" ,
      location %in% c_america ~ "Central America" ,
      location %in% s_america ~ "South America" ,
      location %in% europe ~ "Europe" ,
      location %in% eu ~ "European Union" ,
      location %in% middle_east ~ "Middle East" ,
      location %in% oceania ~ "Oceania" ,
      location %in% c_asia ~ "Central Asia" ,
      location %in% s_asia ~ "South Asia" ,
      location %in% se_asia ~ "Southeast Asia" ,
      location %in% e_asia ~ "East Asia" ,
      location %in% w_asia ~ "West Asia" ,
      TRUE                  ~ NA_character_ 
  ))

# calculate the number of international, female and all students per region
region_stats <- est_int_per_gender %>%
        group_by(region) %>%
        summarise(sum_international = sum(approx_int), sum_female = sum(approx_fem), total_students = sum(num_students)) 

In [10]:
 # Plot for international students
plot_international <- ggplot(region_stats, aes(x = region, y = sum_international, fill = "International"))+
  geom_bar(stat = "identity") +
  geom_text(aes(label = sum_international), 
            vjust = 0, size = 4, color = "black") + # Add text annotations for y-values
  scale_fill_manual(values = "lightgreen") +  # Set color for international bars
  labs(y = "# International Students") +
  scale_y_continuous(limits = c(0, max(region_stats$sum_female))) +
  theme(legend.position = "none",
        text = element_text(size = 20),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 0, size =12),
        axis.title.x = element_blank(), # Remove x-axis title
        axis.text.x.top = element_text(size = 15),
        axis.text.y = element_blank(),   # Remove x-axis text 
        axis.title.y = element_text(hjust = 0.2),
        axis.ticks.length = unit(0, "cm"),
        panel.background = element_rect(fill = "white")
  ) 

# plot for female students
plot_female <- ggplot(region_stats, aes(x = region, y = sum_female, fill = "Female")) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sum_female),  
            vjust = 1, size = 4, color = "black") +  # Add text annotations for y-values
  scale_fill_manual(values = "lightpink") +  # Set color for female bars
  labs(y = "# Female Students") +
  theme(legend.position = "none",
        text = element_text(size = 20),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(),   # Remove x-axis text  
        axis.title.x = element_blank(),  # Remove x-axis title
        axis.text.y = element_blank(),   # Remove x-axis text 
        axis.title.y = element_text(hjust = 0.8),
        axis.ticks.length = unit(0, "cm"),  # Remove axis ticks
        panel.background = element_rect(fill = "white")
  )  + scale_y_reverse()

# put plots together
options(repr.plot.width = 12, repr.plot.height = 20) 
combo_plot <- plot_grid(plot_international,plot_female, nrow = 2)
# ggsave("Female_vs_International.png", combo_plot, width = 12, height = 15)

<br><br>

---
## Methods and Plan

This section proposes a method to address the research question: **Does a university's location and student body makeup influence the number of female students?** We will explore the relationship between international and female students at universities and investigate regional variations in this relationship.

<br>

### Model:
To address the question of whether a university's location and student body makeup influence the number of female students, multiple linear regression (MLR) can be used as the method of analysis. MLR is appropriate for this study because:
* It allows examination of the relationship between multiple independent variables (such as geographical location and student demographics) and a single dependent variable (number of female students).
* **Modeling Continuous Outcome:** The number of female students is a continuous variable. Linear regression is a well-established method for modelling the relationship between continuous independent variables and a continuous dependent variable.
* **Accounting for Location:** We can include a categorical variable representing region as an independent variable in the model. This allows us to assess the overall effect of location on the number of female students.
* **Interaction Effect:** To investigate if the relationship between international students and female students varies across regions, we can include an interaction term in the model. A statistically significant interaction term would indicate that the effect of international students on female enrollment depends on the specific region.  

### Assumptions of Linear Regression:

One of the main assumptions of MLR is that there is a linear relationship between the independent variables and the dependent variable. In our case, we assume that changes in the geographical location of universities and the composition of their student bodies (including international students and female-to-male ratio) have a linear effect on the number of female students enrolled. While MLR does not require the independent variables to be normally distributed, it does assume that the residuals (the differences between observed and predicted values) are normally distributed.

### Limitations of Linear Regression:

* **Non-linear Relationships:** If the true relationship between the variables is not linear, the model might not capture it accurately.  
* **Outliers:** Outliers can significantly impact the results of linear regression.  
* **Multicollinearity:** If the independent variables are highly correlated, it can lead to unstable coefficient estimates. We can check for multicollinearity using correlation analysis and variance inflation factors (VIF).

Linear regression with an interaction term offers a suitable approach to investigate the research question. It allows us to model the relationship between international students, location, and female student enrollment while accounting for potential interaction effects. By acknowledging the assumptions and limitations of this method, we can carefully interpret the results and gain valuable insights into the factors influencing gender balance at universities across different regions. 

<br><br>

---
##  Implementation of a proposed model

In this section, we will implement the interactive MLR model and discuss the results.


- [x] Add the new content at the end of your existing notebook containing the data description, question, visualization and plan. We won't regrade those but they provide useful context.
- [x] Name the new section "Implementation of a proposed model"
- [ ] Write a computational code to implement *one* of the method proposed in the previous assignment (or suggested in the interview). 
- [ ] Use *only one* visualization or table to report the results.
- [ ] In 3 or 4 sentences give a brief interpretation of the results. 
- [ ] If needed, comment on any unexpected result or potential problems with the analysis, and possible ways to address issues encountered. 
- [ ] If results are as expected, explain how they address the question of interest.
- [ ] **Do not exceed the 4 sentences limit**.

In [11]:
head(region_stats)  
head(est_int_per_gender)

region,sum_international,sum_female,total_students
<chr>,<dbl>,<dbl>,<int>
Caribbean,1321,55894,87070
Central America,15450,580624,1125053
Central Asia,4778,28114,52830
East Asia,196857,1547288,3579350
Eastern Africa,2356,108106,293743
Europe,536971,1195066,2143062


Unnamed: 0_level_0,name,location,num_students,pct_international,pct_female,pct_male,approx_int,approx_int_fem,approx_int_male,approx_fem,approx_male,region
Unnamed: 0_level_1,<chr>,<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,University of Abou Bekr Belkaïd Tlemcen,Algeria,40203,1,68,32,402,273,129,27338,12865,Northern Africa
2,Blida 1 University,Algeria,30076,1,68,32,301,205,96,20452,9624,Northern Africa
3,Université 8 Mai 1945 Guelma,Algeria,17530,1,67,33,175,117,58,11745,5785,Northern Africa
4,Oran 1 University,Algeria,25775,1,66,34,258,170,88,17012,8764,Northern Africa
5,Université Mouloud Mammeri de Tizi-Ouzou,Algeria,41107,0,66,34,0,0,0,27131,13976,Northern Africa
6,Ferhat Abbas Sétif University 1,Algeria,34637,1,63,37,346,218,128,21821,12816,Northern Africa


<br>

---
##  References 

Syed Ali Taqi. (2023). World University Rankings 2023 [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6394958