# Non Cognitive Factors of Learning

In the scopre of academic performance, Non Cognitive factors are concepts such as learners' academic behaviors, academic perseverance, academic mindsets, learning strategies and social skills, etc.
In this notebook we will be exporing one of such factors called confidence. We will be using question level data with student self reported confidence.
During this hands on tutorial, we will do the follow:

* Create fake data for exploration (b/c real data comes with privacy concerns)
* Develop features/variables that we will be analyzing
* Visualize the results

In [2]:
#read in the necessary libraries (libraries are packages of functions that help us reuce code instead of recreating and repeating)
import numpy as np
import pandas as pd
import random
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

## I. Creating Fake Data

We will create data which represents a student answering a question. The data has the following attributes:

* user_id: unique to the student
* question_id: unique to the question
* prescore: [0-3] and represents the students' self reported confidence
  * 0: No idea
  * 1: Unsure
  * 2: Think so
  * 3: I know it
* postscore: [0-1] and represents if the student got the question wrong [0] or right [1]


### 1. Uniform random number
We will start with a simple uniform random number for generating both the prescore and postscore data.

In [4]:
#fake data via random number generation with conditions but no weights
results_i = []
for student in range(100):
  for question in range(100):
    user_id = student
    question_id = question
    prescore = random.randint(0,3)
    postscore = random.randint(0,1)
    student_question = (user_id, question_id, prescore, postscore)
    results_i.append(student_question)

In [5]:
#see inside the dataFrame
results_i

In [6]:
#convert the list into a dataframe and assign columns (we won't be using this until cell 20)
results_i_df = pd.DataFrame(results_i, columns = ('user_id', 'question_id', 'confidence', 'score'))
results_i_df

### II. Weighted random number generator
We will now create weighted random numbers and re-compute the data. Here we can use the probabilities from the real data

#### 1. Weighted prescore
For a weighted random number r, the prescore is defined as

* 0  < r < c1
  * prescore = 0
* c1 < r < c2
  * prescore = 1
* c2 < r < c3
  * prescore = 2
* c3 < r < 1
  * prescore = 3
  
With c1=0.05, c2=0.1, and c3=0.3.

This means the data should have the distribution of 0 (5%), 1 (5%), 2 (20%), 3 (70%) which mimicks the real data.

#### 2. Weighted postscore
In a similar way, we create weighted postscores with the distribution of 0 (40%) and 1 (60%).

In [8]:
#fake data via random number generation with conditions and weighted distribution
def weighted_prescore():
  cutoff1 = 0.05
  cutoff2 = 0.1
  cutoff3 = 0.3

  randy = random.random()
  if randy < cutoff1:
    return 0
  elif randy < cutoff2:
    return 1
  elif randy < cutoff3:
    return 2
  else:
    return 3
  
def weighted_postscore():
  cutoff1 = .4
  
  randy = random.random()
  if randy < cutoff1:
    return 0
  else:
    return 1
  
results = []
for student in range(100):
  for question in range(100):
    user_id = student
    question_id = question
    prescore = weighted_prescore()
    postscore = weighted_postscore()
    student_question = (user_id, question_id, prescore, postscore)
    results.append(student_question)
    
results

## Exercise: what weights would you assign to create data that has
prescore 0 = 5% 
prescore 1 = 10%
prescore 2 = 30%
prescore 3 = 55%

In [10]:
#your code goes here

In [11]:
#double check the type of your new object
type(results)

In [12]:
#convert the list into dataframe and assign column names
results_df = pd.DataFrame(results, columns = ('user_id', 'question_id', 'confidence', 'score'))
results_df

![Confidence_profiles](files/images.jpg)

In [14]:
![Confidence](https://octodex.github.com/images/yaktocat.png)

In [15]:
#add another column to the dataframe with conditions for overconfidence and underconfidence (see the ppt slide 11)
results_df['conf_profile'] = np.where(((results_df.confidence == 0) | (results_df.confidence == 1)) & (results_df.score == 1), 'underconf', 
                                   np.where(((results_df.confidence == 2) | (results_df.confidence == 3)) & (results_df.score == 0), 'overconf', 'other'))
print(results_df)

In [16]:
#create a column for the count of underconfidence instances
results_df['uconf'] = np.where(results_df['conf_profile'] == 'underconf', 1, 0)
results_df

## Exercise: using confidence and score columns, create a boolean column like overconf or uconf and name it 'knowledgeable'. This column will be populated with 1 only when confidnce = 3 and score = 1. Otherwise it will have a value of 0.

In [18]:
#your code goes here

In [19]:
#create a column for the count of overconfidence instances
results_df['overconf'] = np.where(results_df['conf_profile'] == 'overconf', 1, 0)
results_df

## Aggregations in panda. 

Key functions to know is groupby which groups your data based on the column you select.
Aggregation, which aggregates by the operation you select: sum, count, mean, etc.
Join, joins all your columns back into one dataframe

In [21]:
#aggregate the data from question level to user level
df = results_df.groupby('user_id')
counts = df.size().to_frame(name='ques_count')
agg_results_final = (counts
 .join(df.agg({'confidence': 'mean'}).rename(columns={'confidence': 'avg_conf'}))
 .join(df.agg({'score': 'sum'}).rename(columns={'score': '#correct_answs'}))
 .join(df.agg({'overconf': 'sum'}).rename(columns={'overconf': '#overconf'}))
 .join(df.agg({'uconf': 'sum'}).rename(columns={'uconf': '#underconf'}))
 .reset_index()
)

agg_results_final

##Plotting: Histograms

In [23]:
#plot distribution of confidence from uniform distribution (see slide 7)
f,ax = plt.subplots()
bins = [-0.5, 0.5,1.5,2.5,3.5,4.5]
plt.hist(results_i_df['confidence'], bins = bins, facecolor='green', alpha=0.5)
plt.xticks(range(0,4))
plt.xlim([-1,4])
plt.xlabel('Confidence Score', fontsize=14)
plt.ylabel('Number of Questions', fontsize=14)
plt.title('Question Confidence\n Uniform Random Numbers', fontsize=16)
display()

In [24]:
#plot distribution of confidence from probability distribution (see slide 7)
f,ax = plt.subplots()
bins = [-0.5, 0.5,1.5,2.5,3.5,4.5]
plt.hist(results_df['confidence'], bins = bins, facecolor='green', alpha=0.5)
plt.xticks(range(0,4))
plt.xlim([-1,4])
plt.xlabel('Confidence Score', fontsize=14)
plt.ylabel('Number of Questions', fontsize=14)
plt.title('Question Confidence\n Weighted Random Numbers', fontsize=16)
display()

###Exercise: Plot distribution of scores both for random data and for probability distribution

In [26]:
#your code goes here

##Correlations show the relationship (negative or positive, strong or weak) between the variables you choose. For example, how correlated is student's midterm grade with their final grade.

In [29]:
#calculate correlation between average confidence and accuracy (see slide 14)
agg_results_final['avg_conf'].corr(agg_results_final['#correct_answs'])

In [30]:
#your code goes here
data = agg_results_final[['avg_conf','#correct_answs']]
correlation = data.corr(method='pearson')
correlation

### Exercise: calculate Pearson correlation between two other values that you find meaningful.

In [32]:
#your code goes here

## Time series

* Group question level df by question index
* Find counts of confidence profiles for the question level group
* Find total number of questions for question level group
* Find % of questions in each conf profile for question level group
* Plot x -> question index vs y -> (%uc, %oc, %k, %r)

## Confidence profiles from uniformly generated random data

In [35]:
question_index = range(0,100)
per_u_c = random.sample(xrange(100), 100)
per_o_c = random.sample(xrange(100), 100)
per_k = random.sample(xrange(100), 100)
per_r = random.sample(xrange(100), 100)


f,ax = plt.subplots()
ax.plot(question_index, per_u_c, 'g-', label='underconf')
ax.plot(question_index, per_o_c, 'r-', label='overconf')
ax.plot(question_index, per_k, 'y-', label='knowledgable')
ax.plot(question_index, per_r, 'b-', label='realistic')
ax.legend(loc='best')
display()

## Confidence profiles from weighted random generation of data

In [37]:
prc_df = results_df.groupby('question_id').agg({'overconf':'sum', 'uconf':'sum'}).reset_index()

# f = {'overconf':['sum'], 'uconf': lambda g: results_df.ix[g.index].uconf.sum() * 100/ 100}
# percent_df = results_df.groupby('question_id').agg(f).reset_index()

# prc_df['perc_u'] = time_df['overconf']
# prc_df['perc_o'] = time_df['uconf']

In [38]:
prc_df

In [39]:
question_index = range(0,100)
per_uc = prc_df['uconf']
per_oc = prc_df['overconf']
#per_k = random.sample(xrange(100), 100)
#per_r = random.sample(xrange(100), 100)

f,ax = plt.subplots()
ax.plot(question_index, per_uc, 'g-', label='Underconf')
ax.plot(question_index, per_oc, 'r-', label='Overconf')
# ax.plot(question_index, per_k, 'y-', label='knowledgable')
# ax.plot(question_index, per_r, 'b-', label='realistic')
ax.legend(loc='best')
plt.xlabel('Percent Questions', fontsize=14)
plt.ylabel('Questions in Order', fontsize=14)
plt.title('Confidence profiles Overtime', fontsize=16)
display()