# MSBA 605-77 - Python for Analytics
## Program 4
__Name__: ADD HERE (double click the cell in Jupyter to edit)<br>
__Due__: Sunday, October 29 (by 11:59 PM)<br />
__Worth__: 100 pts.<br />
__Purpose__: Use pandas to read and analyze a data file, calculating a correlation between test questions and overall performance.

Add your code to the cells below. When finished, be sure to save your notebook, then _Close and Shutdown Notebook_ from the _File_ menu. Return to Blackboard and upload your completed Notebook file (`Prog4.ipynb`).

### Directions
In the code cell below, write Python code to accomplish the following using a combination of pandas, NumPy, and native Python. Correlation may be applied to exam and quiz question results to determine how effective each question is at differentiating between students that did well and students that did not do well on the assessment. For background, review Blackboard's [Item Analysis article](https://help.blackboard.com/Learn/Instructor/Original/Tests_Pools_Surveys/Item_Analysis). Specifically, review the __Question statistics table on the Item Analysis page__ section for the definition of the _Discrimination_ statistic. It notes:

> _Discrimination_: Indicates how well a question differentiates between students who know the subject matter and those who don't. A question is a good discriminator when students who answer the question correctly also do well on the test. Values can range from -1.0 to +1.0. A question is flagged for review if its discrimination value is less than 0.1 or negative. Discrimination values can't be calculated when the question's difficulty score is 100% or when all students receive the same score on a question.

> Discrimination values are calculated with the Pearson correlation coefficient. `X` represents the scores of each student on a question and `Y` represents the scores of each student on the test.
> 
> <img src="https://help.blackboard.com/sites/default/files/bb_assets_embed/15000/tests_item_analysis_pearson_formula.png">
> 
> These variables are the standard score, sample mean, and sample standard deviation, respectively:
>
> <img src="https://help.blackboard.com/sites/default/files/bb_assets_embed/15000/tests_item_analysis_pearson_formula_definitions.png">

Here, _n_ is the number of students. 

The provided file `P4-ScoreData.csv` contains results for a quiz for which you are asked to calculate the above _Discrimination_ statistic for each question. Using a combination of pandas, NumPy, and native Python complete the following steps to calculate the discrimination statistic step-by-step and then again using pandas `corrwith` method to calculate the same Pearson correlation coefficient:

1. Use pandas `.read_csv` function to read the `P4-ScoreData.csv` file into a `DataFrame` as described in [Section 6.1 Reading and Writing Data in Text Format](https://wesmckinney.com/book/accessing-data#io_flat_files). Be sure that you save the data file in the same folder as your Notebook. This way, you can simply use the filename without additional path information when reading the file. You should print the `DataFrame` to ensure the file was read in correctly. Once satisfied, you may comment out the print statement.
2. It appears that a few students left some questions unanswered when taking the quiz. These appear as `NaN` values in the `DataFrame`. Use the `.fillna` method of the `DataFrame` to fill those missing values with zeros as described in [Section 7.1's Filling In Missing Data subsection](https://wesmckinney.com/book/data-cleaning#pandas_missing_filling). Again, you should print the `DataFrame` to ensure the missing data were filled in correctly. Once satisfied, you may comment out the print statement.
3. The student names here are simply labels for the row-level data. Use the `.set_index` method of the `DataFrame` to turn the `Name` column into the index. You may include the `inplace=True` argument to modify your `DataFrame` object instead of returning a copy. You may print the `DataFrame` once more to verify the change was made correctly. Once satisfied, you may comment out the print statement.
4. Use the `DataFrame`'s `shape` attribute to unpack the number of students and the number of questions from the associated tuple. Print these values and ensure that you have 15 students and 10 questions.
5. Calculate the mean score for each question using the `mean` method of the `DataFrame` and store in a variable that you can reference later when calculating the Discrimination statistic. Using the default axis of the `DataFrame` should produce a mean for each of the 10 questions. You should print the question means to ensure that they were calculated correctly. Once satisfied, you may comment out the print statement.
6. Similarly, calculate the sample standard deviation for each question using the `std` method of the `DataFrame` and store in a variable that you can reference later when calculating the Discrimination statistic. Using the default axis of the `DataFrame` should produce a standard deviation for each of the 10 questions. You should print these values to ensure that they were calculated correctly. Once satisfied, you may comment out the print statement.
7. Add a new column named `Total` to the `DataFrame` using the `sum` method of the `DataFrame`, specifying `axis=1`. Print the `DataFrame` after this step to ensure that each student's total score for the quiz was calculated correctly.
8. Calculate the overall quiz mean by calling the `mean` method on the `DataFrame`'s `Total` column and store in a variable that you can reference later when calculating the Discrimination statistic. Print the quiz mean score to ensure that it was calculated correctly. Once satisfied, you may comment out the print statement.
9. Similarly, calculate the sample standard deviation of the overall quiz scores by calling the `std` method on the `DataFrame`'s `Total` column and store in a variable that you can reference later when calculating the Discrimination statistic. Print the quiz standard deviation to ensure that it was calculated correctly. Once satisfied, you may comment out the print statement.
10. For the next step, you will need to isolate the 10 question columns. An easy way to do this is to create an index by slicing the `columns` attribute to include only the first 10 columns.
11. To finish the calculation of the Discrimation statistic step-by-step for each question, you will need to calculate the difference between each student's score on that question and the question's mean. This difference will then be divided by that question's standard deviation. This is the term involving the `X`'s in the equation above.
12. In addition, you need to calculate the difference between each student's total score on the quiz and the overall quiz mean. This difference will then be divided by the sample standard deviation of the overall quiz scores. This is the term involving the `Y`'s in the equation above.
13. Next, calculate the product of these two term variables you created in the previous two steps. You'll need use the Transposition of score differences you constructed in Step 11 above when multiplying.
14. Finally, to complete the calculation of the Discrimation statistic for each question, you need to sum the Transposition of the product you calculated in the previous step and divide that sum by `num_students-1`. Print the resulting question Discrimination statistics.
15. A much simpler way to calculate the Discrimination statistic is simply to use pandas `corrwith` method to calculate the Pearson correlation coefficient for all the quiz questions correlated with the `Total` column you calculated in Step 7 above. This should result in the same Discrimination statistic for each of the quiz questions but without needing Steps 8-14 above. Print this result and compare to the values you calculated step-by-step earlier. Before printing, drop the `Total` from the Series produced as your result (as it will obviously correlate perfectly with itself). 
16. Using the information your analysis yielded, answer the following in comments at the bottom of your code.<br />a. Which question appears to need review based on a negative Discrimination statistic (suggesting students who did better on the quiz overall actually did worse on this question)?<br />b. Why might this not be an appropriate interpretation given the actual student scores?

To assist in verifying that your calculations are correct, I've included an Excel file named `P4-Calculations.xlsx` that mirrors these steps using Excel formulas. This should help you understand what needs to be done in the step-by-step calculations and confirm that your results match up.

In [27]:
import pandas as pd
import numpy as np

filename = 'P4-ScoreData.csv'

# 1:
frame = pd.read_csv('P4-ScoreData.csv')
print("Regular Dataframe:")
print (frame)
print("")
#Extracting ScoreData file to read and import in assignment

# 2: 
frame.fillna(0, inplace=True)
print("Dataframe without NaN:")
print (frame)
print("")
#Using extracted file to fill in NaN items with 0.0

# 3: 
frame.set_index('Name', inplace=True)
print("Dataframe Index:")
print (frame)
print("")
#Changes 'Name' file into Index

# 4: 
stud_number, ques_number = frame.shape
print("# of students:", stud_number)
print("# of questions:", ques_number)
print ("")

# 5: 
mean_ques = frame.mean()
print("Mean:")
print (mean_ques)
print ("")
#Calculates mean score for each question

# 6: 
stddev_ques = frame.std()
print("St Dev:")
print (stddev_ques)
print ("")
#Calculates standard deviation score for each question

# 7: 
frame['Total'] = frame.sum(axis=1)
print("Sum:")
print (frame)
print ("")

# 8:
mean_quiz = frame['Total'].mean()
print("Mean Quiz:")
print (mean_quiz)
print ("")
#Calculates overall quiz mean

# 9:
stddev_quiz = frame['Total'].std()
print("St Dev Quiz:")
print (stddev_quiz)
print ("")
#Calculates overall quiz standard deviation

# 10:
scores_ques = frame.iloc[:, :ques_number]

# 11: 
a_diff = scores_ques.sub(mean_ques)

# 12:
b_diff = df['Total'] - mean_quiz

# 13: 
prod = a_diff.mul(b_diff, axis=0)

# 14: 
value_disc = prod.sum() / (stud_number - 1)

# 15: 
total_corr = scores_ques.corrwith(frame['Total'])
total_corr.drop('Total', inplace=True, errors='ignore')  # Drop the self-correlation

print("Discrimination values by step:")
print(value_disc)
print ("")

print("Discrimination values by correlation:")
print(total_corr)
print ("")

# Comments:
#a. Question 5 indicates a negative result thus inferring evidence students who did better on the quiz did not perform well on this question.
#b. The data only points out an individual question score and the total quiz score corrlation. There are other specific factors that might affect the results such as a student guessing the correct answer.



Regular Dataframe:
        Name    Q1    Q2    Q3    Q4  Q5   Q6   Q7    Q8    Q9   Q10
0     Sophia   9.0   5.0   8.0   8.5  10  8.0  8.0   9.0  10.0   9.0
1   Muhammad   8.0   NaN   7.0   9.0  10  7.0  7.0   9.5   9.0   7.5
2     Olivia   9.5   5.0   7.5   9.0  10  9.0  8.5   9.5  10.0   9.0
3      Aiden   7.0   0.0   5.0   7.5  10  6.0  6.0   8.0  10.0   7.0
4        Mia   8.0   5.0   6.5   8.5  10  7.5  7.0   9.0   9.5   8.0
5       Liam  10.0   0.0  10.0   9.0  10  9.0  9.0   9.5  10.0   9.0
6   Isabella   9.5   5.0   7.0  10.0  10  8.0  8.5  10.0  10.0   9.5
7     Elijah   7.5  10.0   6.5   8.5  10  7.0  6.5   9.0  10.0   7.0
8        Ava   9.0  10.0   7.0  10.0   9  8.0  8.0  10.0  10.0   8.0
9      Mateo   8.0  10.0   6.5   9.0  10  7.5  7.0   9.5   9.0   8.0
10       Zoe   7.0   5.0   8.0   9.0  10  8.0  8.0   9.5  10.0   7.5
11   Jackson   9.5  10.0   7.5   8.5  10  8.0  8.5   9.0  10.0   9.0
12   Aaliyah  10.0   5.0   9.0  10.0  10  9.0  9.0  10.0  10.0  10.0
13      Noah   

Be sure to save and exit your Jupyter Notebook and Shutdown Jupyter Lab (from the __File__ menu) before you submit your notebook on Blackboard for grading.