# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Regression Challenge
Week 6 | Days 4-5

## The Times university ranking dataset analysis

A university has chosen your consulting firm to help them understand and improve their standings in the Times rankings of universities.  Your company will be presentings your findings to a review board from the university Friday afternoon at 4pm.  The review board consists of members from across the university: both professors and administrators.  Your presentation must contain both a technical defense of your work and solid recommendations to move forward.

In this challenge, you will draw on the skills you have learned over the past three weeks to create a model of university prestige using the provided predictors. Specifically, your goal is to **predict the total score for each university for the year 2016**. This score directly maps into the university ranking.

You will be drawing on the following skills:
- Basic python and pandas skills
- Data cleaning
- EDA
- Modeling
    - Regression
    - Regularization
    - kNN
    - SGD
- Cross validation

## The Dataset

The data is in a csv file in your repo. It contains the following columns:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (e.g. = 94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

### The target for our model is the **total score**, which directly corresponds to the final ranking.

### Note: A total score reported as "-" should be considered a 0, and '-' entries in the submisison will be scored as  0. Consider the implications of this for calculating the loss (MSE).


## Guidelines

The analysis is up to you. **This is fully open-ended.** You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create or combine new columns/features where beneficial
- Perform statistical analysis with regression and describe the results

---

We will be here in class to help, but if you do not know how to do something, we expect you to **check documentation first**.

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

## Deliverables

- You will be provided with the data and targets from 2011 through 2015 and the data (no targets) for 2016.
    + Final answers should be submitted by filling in the predicted values for each university in the **submission sheet**. (.to_csv() should be useful for this). Note there are three columns: Rank, Name, and Score. The row order matches the rankings list from the 2016 data.
    + Your submission will be assessed on MSE for the Score column -- so consider your loss functions! <br><br>


- Your team will also submit a clean and annotated version of the **analysis notebook** you used to make your model and produce predictions. <br><br>


- Finally, you will design and deliver an **8 minute presentation** to the university review board. Your presentation must:
    + Defend your model selection
    + Defend your model performance
    + Interpret the model
    + Give clear guidelines to improve rankings


- Remember your audience contains some experts who will want to understand the strength of your predictions, others who will want to know they are getting the best version possible, and still more who only want to know which actions will help their ranking the most.  Do your best to meet all these demands without alienating anyone.

## Teams

You will be working as part of a team on this:

<table>
<tr>
<th>Team 8d599</th>
<th>Team 5d87e</th>
<th>Team 41cfa</th>
<th>Team c13cf</th>
</tr>
<tr>
<td>Ryan</td>
<td>Yuriy</td>
<td>Ridhi</td>
<td>Andrew</td>
</tr>
<tr>
<td>Arafat</td>
<td>Joe</td>
<td>Alex</td>
<td>Will</td>
</tr>
<tr>
<td>Mark</td>
<td>Eric</td>
<td>Donna</td>
<td>Tim</td>
</tr>
<tr>
<td></td>
<td>Scott</td>
<td></td>
<td>Marina</td>
</tr>
</table>


As a team you are responsible for one copy of each deliverable to be submitted as an issue on this repo by **4 PM Friday**.  That is to say, one each of:
- Submission csv
- Analysis notebook
- 8 min Presentation

Every team member must speak in the final presentation, and be ready to answer questions on the process the group used to refine, choose, and interpret the model used to make predictions.

In [298]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn import svm, linear_model
from sklearn.linear_model import LogisticRegression
# from sklearn.model_selection import  cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split

In [299]:
data = pd.read_csv('datasets/challenge-dataset.csv', encoding='latin-1')

In [300]:
data.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42:58:00,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45:55:00,2011


In [301]:
data.describe()

Unnamed: 0,teaching,research,citations,student_staff_ratio,year
count,2603.0,2603.0,2603.0,2544.0,2603.0
mean,37.801498,35.910257,60.921629,18.445283,2014.075682
std,17.604218,21.254805,23.073219,11.458698,1.685733
min,9.9,2.9,1.2,0.6,2011.0
25%,24.7,19.6,45.5,11.975,2013.0
50%,33.9,30.5,62.5,16.1,2014.0
75%,46.4,47.25,79.05,21.5,2016.0
max,99.7,99.4,100.0,162.6,2016.0


In [302]:
'''Null Columns'''
print (data.columns[data.isnull().sum()>0])

Index(['total_score', 'num_students', 'student_staff_ratio',
       'international_students', 'female_male_ratio'],
      dtype='object')


In [303]:
data.isnull().sum()

world_rank                  0
university_name             0
country                     0
teaching                    0
international               0
research                    0
citations                   0
income                      0
total_score               800
num_students               59
student_staff_ratio        59
international_students     67
female_male_ratio         233
year                        0
dtype: int64

In [304]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603 entries, 0 to 2602
Data columns (total 14 columns):
world_rank                2603 non-null object
university_name           2603 non-null object
country                   2603 non-null object
teaching                  2603 non-null float64
international             2603 non-null object
research                  2603 non-null float64
citations                 2603 non-null float64
income                    2603 non-null object
total_score               1803 non-null object
num_students              2544 non-null object
student_staff_ratio       2544 non-null float64
international_students    2536 non-null object
female_male_ratio         2370 non-null object
year                      2603 non-null int64
dtypes: float64(4), int64(1), object(9)
memory usage: 284.8+ KB


In [305]:
'''Checking for Hyphens in the International Columns'''
np.sum([data['international'] == '-'])

9

In [306]:
'''Replacing the Hyphens with the NAN to impute later'''
data['international'] = data.international.replace('-', np.nan)

In [307]:
data['international'] = data.international.replace('-', np.nan)

from sklearn.preprocessing import Imputer

international_imputer = Imputer(strategy='median')
international_imputer_fit = international_imputer.fit(data['international'].values.reshape(-1, 1))

international_imputer_transform = international_imputer_fit.transform(
    data['international'].values.reshape(-1, 1))

data['international'] = international_imputer_transform

In [308]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603 entries, 0 to 2602
Data columns (total 14 columns):
world_rank                2603 non-null object
university_name           2603 non-null object
country                   2603 non-null object
teaching                  2603 non-null float64
international             2603 non-null float64
research                  2603 non-null float64
citations                 2603 non-null float64
income                    2603 non-null object
total_score               1803 non-null object
num_students              2544 non-null object
student_staff_ratio       2544 non-null float64
international_students    2536 non-null object
female_male_ratio         2370 non-null object
year                      2603 non-null int64
dtypes: float64(5), int64(1), object(8)
memory usage: 284.8+ KB


In [309]:
'''Count of Hyphens in the Total_Score Column'''
np.sum([data['total_score'] == '-'])

802

In [310]:
'''Replacing the Hyphens with the NAN to impute later'''
# data['total_score'] = data.total_score.replace('-', np.nan)

'Replacing the Hyphens with the NAN to impute later'

In [311]:
# data['total_score'] = data.total_score.replace('-', np.nan)

# from sklearn.preprocessing import Imputer

# total_score_imputer = Imputer(strategy='median')
# total_score_imputer_fit = total_score_imputer_fit.fit(data['total_score'].values.reshape(-1, 1))

# total_score_imputer_transform = total_score_imputer_fit.transform(
#     data['total_score'].values.reshape(-1, 1))

# data['total_score'] = total_score_imputer_transform

In [312]:
'''Income Imputer '''
for k, v in data['income'].items():
    if v == '-':
        data.set_value(k, 'income', np.nan)
    else:
        data.set_value(k, 'income', v)
        
data['income'] = data['income'].astype(float)        
        
income_imputer = Imputer(strategy='median')
income_imputer_fit = income_imputer.fit(data['income'].values.reshape(-1, 1))

income_imputer_transform = income_imputer_fit.transform(
    data['international'].values.reshape(-1, 1))

data['income'] = income_imputer_transform

In [313]:
'''Country'''
for k, v in data['country'].items():
    if v == 'United States of America':
        data.set_value(k, 'country', 1)
    else:
        data.set_value(k, 'country', 0)

data['country'] = data['country'].astype(float)

In [314]:
'''Num Students'''
data['num_students'] = data['num_students'].str.replace(',', '').astype(float)

In [315]:
data['international_students']=[str(pct).split('%')[0] for pct in data['international_students']]
data['international_students']=[float(pct) for pct in data['international_students']]

In [316]:
'''International Students Imputer'''

data['international_students']=[str(pct).split('%')[0] for pct in data['international_students']]
data['international_students']=[float(pct) for pct in data['international_students']]

data['international_students'] = data.international_students.replace('-', np.nan)
from sklearn.preprocessing import Imputer

international_s_imputer = Imputer(strategy='median')
international_s_imputer_fit = international_s_imputer.fit(data['international_students'].values.reshape(-1, 1))

international_s_imputer_transform = international_s_imputer_fit.transform(
    data['international_students'].values.reshape(-1, 1))

data['international_students'] = international_s_imputer_transform

In [317]:
# '''Fixing total Score'''
# def change_hyphen(x): 
#     if x == '-': 
#         return float(0)
#     else: 
#         return float(x)
    
# data['total_score'] = data['total_score'].apply(change_hyphen)

In [320]:
data['total_score'] = data['total_score'].str.replace('-', '0').astype(float)

In [321]:
data.total_score.value_counts()

0.0     802
49.0     13
46.6     11
46.9     10
46.7      9
46.2      9
51.6      8
54.4      7
51.1      7
53.6      7
54.5      7
51.2      7
50.5      7
56.1      7
50.2      7
50.0      7
51.9      7
54.6      7
59.0      7
50.1      7
53.4      7
52.6      6
52.9      6
52.5      6
57.7      6
57.3      6
49.7      6
63.2      6
48.5      6
55.2      6
       ... 
74.9      1
77.5      1
71.0      1
80.5      1
42.4      1
50.7      1
79.1      1
66.1      1
68.9      1
96.1      1
85.2      1
95.5      1
84.5      1
48.4      1
76.9      1
84.9      1
82.8      1
43.2      1
73.3      1
73.4      1
74.2      1
76.4      1
70.1      1
72.9      1
75.1      1
68.1      1
93.1      1
72.1      1
86.9      1
96.0      1
Name: total_score, Length: 386, dtype: int64

In [294]:
# data.total_score.fillna(value=0, inplace=True)

In [295]:
# data.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,1.0,99.7,72.4,98.7,98.8,72.4,96.1,20152.0,8.9,25.0,,2011
1,2,California Institute of Technology,1.0,97.7,54.6,98.0,99.9,54.6,96.0,2243.0,6.9,27.0,33 : 67,2011
2,3,Massachusetts Institute of Technology,1.0,97.8,82.3,91.4,99.9,82.3,95.6,11074.0,9.0,33.0,37 : 63,2011
3,4,Stanford University,1.0,98.3,29.5,98.1,99.2,29.5,94.3,15596.0,7.8,22.0,42:58:00,2011
4,5,Princeton University,1.0,90.9,70.3,95.4,99.9,70.3,94.2,7929.0,8.4,27.0,45:55:00,2011


In [297]:
data.total_scoredata,..value_counts()

-       802
0       800
49       13
46.6     11
46.9     10
46.7      9
46.2      9
51.6      8
54.6      7
50.2      7
51.2      7
54.5      7
59        7
51.9      7
51.1      7
50        7
50.5      7
53.6      7
56.1      7
50.1      7
53.4      7
54.4      7
57.3      6
49.9      6
52.9      6
49.7      6
57.7      6
46        6
45.9      6
50.4      6
       ... 
75.5      1
59.7      1
82        1
84.9      1
75        1
83.2      1
69.2      1
68.9      1
43.2      1
90.9      1
77.3      1
75.7      1
80.5      1
63.1      1
71.7      1
90.4      1
67.2      1
63.6      1
90.5      1
43.1      1
68.1      1
44.9      1
42.5      1
90.2      1
77.1      1
95.6      1
75.1      1
72.9      1
68.4      1
79.5      1
Name: total_score, Length: 387, dtype: int64