## UFCFVQ-15-M Programming for Data Science (Spring 2023)
# Python Programming with Libraries (Task C)

## Student Id: 22063018

### Requirement FR7 - Read CSV data from two files and merge it into a single Data Frame 

In [None]:
import pandas as pd

#Read CSV data from the two files 
cd1 = pd.read_csv('task2a.csv')

cd2 = pd.read_csv('task2b.csv')

# Merge the two data frames based on the 'id_student'
merged_cd = pd.merge(cd1, cd2, on='id_student')

# Print the merged data frame
merged_cd


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR8 - Clean the merged data

In [None]:
import pandas as pd

# Remove any rows with missing values
merged_cd = merged_cd.dropna()

#merged_cd.columns

# Remove columns that are not needed
merged_cd = merged_cd.drop(['region', 'final_result', 'highest_education'], axis=1)

# Print the cleaned merged data
merged_cd


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR9 - Filter out unnecessary rows

In [None]:

# Filter out irrelevant rows greater than 200000.
merged_cd = merged_cd.drop_duplicates(subset = 'id_student')

print(merged_cd['id_student'].duplicated().sum())

# Remove duplicate rows based on id_student
merged_cd = merged_cd.drop_duplicates(subset='id_student', keep='first')

# Print the filtered data frame
merged_cd


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR10 - Investigate the effects of engagement on attainment

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the merged, cleaned, and filtered data into a data frame
merged_cd = pd.merge(pd.read_csv('task2a.csv'), pd.read_csv('task2b.csv'), on='id_student')
merged_cd = merged_cd.dropna()
merged_cd = merged_cd.drop(['region', 'final_result', 'highest_education'], axis=1)
merged_cd = merged_cd[merged_cd['click_events'] <= 20000]
merged_cd = merged_cd.drop_duplicates(subset='id_student', keep='first')

# Create a scatter plot of click_events vs. score
plt.scatter(merged_cd['click_events'], merged_cd['score'])
plt.xlabel('click_Events')
plt.ylabel('Score')
plt.title('Investigate the effects of engagement(click_events) on attainment (score)')
plt.show()
# From the scatter plot, we can see that there is a slight positive correlation between click_events and score, 
#meaning that as the level of engagement (click_events) increases, the level of attainment(score) also tends to increase.
#However, the correlation is not very strong, and there are many data points that do not follow this trend.


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR11 - Test the hypothesis that engagement has some effect on levels of attainment 

In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Load the merged, cleaned, and filtered data into a data frame
merged_cd = pd.merge(pd.read_csv('task2a.csv'), pd.read_csv('task2b.csv'), on='id_student')
merged_cd = merged_cd.dropna()
merged_cd = merged_cd.drop(['region', 'final_result', 'highest_education'], axis=1)
merged_cd = merged_cd[merged_cd['click_events'] <= 20000]
merged_cd = merged_cd.drop_duplicates(subset='id_student', keep='first')

# Perform a Pearson correlation test between click events and score
corr, p_value = pearsonr(merged_cd['click_events'], merged_cd['score'])
print(f'Pearson correlation coefficient: {corr:.3f}')
print(f'P-value: {p_value:.3f}')

  #Check if p-value is less than 0.05
if p_value < 0.05:
    print("There is a statistically significant correlation between click_events and score.")
else:
    print("There is insufficient evidence to support a correlation between click events and score.")


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR12 - Investigate the effects of gender on levels of attainment 

In [None]:
import seaborn as sns

x='gender'
y='score'
sns.boxplot(x='gender', y='score', data=merged_cd)

#explanations of my findings
#from the box plot, the median score for females is slightly higher than for males,
#and the interquartile range is also slightly larger for females
#However, there is some overlap between the two boxes, 
#indicating that there is no clear-cut difference in attainment levels between the two genders
#Therefore, we cannot conclude that gender has a significant effect on levels of attainment based on this visualization alone.


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Requirement FR13 - Test if there is any difference between the attainment of male and female students

In [None]:
from scipy.stats import ttest_ind

male_scores = merged_cd[merged_cd['gender'] == 'M']['score']
female_scores = merged_cd[merged_cd['gender'] == 'F']['score']

t_stat, p_val = ttest_ind(male_scores, female_scores)

print("t-statistic:", t_stat)
print("p-value:", p_val)

#significance value
As=0.05
 #Explanations to my results
if p_val > As:
    print("There is a significant difference between the attainment of male and female students.")
else:
    print("There is no significant difference between the attainment of male and female students.")

<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

# Coding Standards
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

# Process Development Report for Programming Task C


## INTRODUCTION
The development process for this task involved several steps,which included reading in the data from two CSV files, merging them into a single dataframe, cleaning and filtering the data,
Investigating the effects of engagement on attainment,
Testing the hypothesis that engagement has some effect on levels of attainment ,investigating the effects of gender on levels of attainment and then testing if there is any difference between the attainment of male and female students.

## PROCESS
Firstly, I read in the data from the two CSV files using the pandas library. I then merged the two dataframes into a single dataframe based on the 'id_student' column.
Next, I removed any rows from the merged data that contained a missing value in any column and removed the unnecessary columns such as region, final_result, and highest_education. I also filtered out unnecessary rows by removing all rows where click_event is greater than 20000 and any duplicate rows based on id_student.

After cleaning and filtering the data, I proceeded to investigate the effects of engagement on attainment using a scatter plot. The scatter plot revealed a positive correlation between engagement and attainment. This suggests that students who have higher click events tend to have higher scores.
I then tested the hypothesis that engagement has some effect on levels of attainment by performing a  Pearson correlation test between click_events and score.The results showed that there is a statistically significant relation between engagement and attainment.
Afterwards, i investigated the effects of gender on levels of attainment using a box plot. The box plot showed if there is any difference in attainment between males and females. I then performed an independent t-test to determine if there is any difference between the attainment of male amd female students.And the results showed that there is a  significant difference in attainment between the  males and females students.

Overall, the development process for this task was straightforward, and the pandas library made it easy to read in and manipulate the data. The use of visualisation tools such as scatter plots and box plots helped to identify patterns and trends in the data, while the t-test allowed for testing if there is any difference between the male and female students. 

One challenge that i faced during the development process was identifying and removing duplicate rows based on the 'id_student' column. It was necessary to do this to ensure that the data was accurate and that the results were not skewed by duplicate data. Another challenge was determining which columns were unnecessary and could be removed from the dataframe without losing valuable information.

## CONCLUSION
In conclusion, this task provided an opportunity to practice data cleaning, filtering, and manipulation, as well as data visualisation and statistical testing. Through this process, I gained valuable insights into the data and was able to draw meaningful conclusions about the effects of engagement and gender on levels of attainment.


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>