### **Objective:** "Which schools have students that would benefit from outreach services and lead to a more diverse group of students taking the SHSAT and being accepted into New York City's Specialized High Schools."

>**Background :** The Specialized High Schools Admissions Test (SHSAT) is an examination administered to eighth and ninth grade students residing in New York City and used to determine admission to all but one of the city's nine Specialized High Schools. In 2008, about 29,000 students took the test, and 6,108 students were offered admission to one of the high schools based on the results. On average, 30,000 students take this exam annually. The test is given each year in October and November, and students are informed of their results the following March. Those who receive offers decide by the middle of March whether to attend the school the following September. The test is independently produced and graded by American Guidance Service, a subsidiary of Pearson Education, under contract to the New York City Department of Education.

* **If we can rank order schools and be granular about it or show schools on a map and take into account diversity, language, poverty, weather, public transit, whatever, then that's probably a good way to go about it.**

>* **Our efforts should be directed at :**  
>1) Increase in SHSAT registration (Year 2013-2016)  
>2) Increase in SHSAT participation (Year 2013-2016)    
>3) Participation to registration ratio (Year 2016)   
>4) Participation to enrollment ratio (Year 2016)   

>**Datasets used :**     
(1) 2016 School Explorer.csv      
(2) D5 SHSAT Registrations and Testers.csv     
(3) ged-plus-locations.csv     
(4) 2010-2016-school-safety-report.csv     

## Exploratory Data Analysis

In my previous kernel on PASSNYC, I performed an in-depth exploration in order to understand the dataset which included geography analysis, time series trends, distributions of important variables of the dataset, analysis, and comparisons.


In [None]:
from IPython.core.display import display, HTML
display("https://www.kaggle.com/rishih/present-sir")
#Link: [https://www.kaggle.com/rishih/present-sir](https://www.kaggle.com/rishih/present-sir)

## [Table of Content :](#as)    

### [1. Preprocessing](#asd)   
### [2. Hierarchial Clustering](#asd)   
### [3. PFA : Principal Feature Analysis](#asd)   
### [4. Regression : Gradient Boosting & Random Forest](#asd)   
### [5. Obtaining most important features](#asdf)   
### [6. Obtaining feature weights using SHST dataset](#asdf)   
### [7. School ranking](#df)  


## 0. Pre-processing

In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np

In [None]:
import pandas as pd
ged=pd.read_csv('../input/ny-ged-plus-locations/ged-plus-locations.csv')
schools=pd.read_csv('../input/data-science-for-good/2016 School Explorer.csv')
secure=pd.read_csv('../input/ny-2010-2016-school-safety-report/2010-2016-school-safety-report.csv')

In [None]:
ged.apply(lambda x:sum(x.isnull()))

Since we won't be using 'Notes' column, we donot preprocess NaN values in ged

In [None]:
ged.head(3)

In [None]:
schools.groupby(['Student Achievement Rating']).agg(np.mean)

* As is evident from the above dataframe, **Exceeding Target and Meeting Target** fare better in terms of proficiency in **Maths and English.** Also, majority of students in schools with rating of Exceeding Target and Meeting Target are either **White or Asian.**

* In the course of this analysis, I shall proceed with the schools which have ratings: **Approaching Target** or **Not Meeting Target**.

In [None]:
pd.set_option('display.max_columns', None)  
a=schools['Address (Full)'].replace({'NEW YORK':'NewYork','CAMBRIA HEIGHTS':'CambriaHeights','SPRINGFIELD GARDENS':'SpringfieldGardens','REGO PARK':'RegoPark','FOREST HILLS':'ForestHills','ROCKAWAY PARK':'ROCKAWAY','HOWARD BEACH':'HowardBeach','QUEENS VILLAGE':'QueensVillage','COLLEGE POINT':'CollegePoint','RICHMOND HILL':'RichmondHill','FLORAL PARK':'FloralPark','OZONE PARK':'OzonePark','LITTLE NECK':'LittleNeck','LONG ISLAND CITY':'LongIslandCity','MIDDLE VILLAGE':'MiddleVillage','ROOSEVELT ISLAND':'RooseveltIsland','STATEN ISLAND':'StatenIsland','JACKSON HEIGHTS':'JacksonHeights','GREENWICH VILLAGE':'GreenwichVillage','GREAT NECK':'GreatNeck','BROAD CHANNEL':'BroadChannel','BRIGHTON BEACH':'BrightonBeach','MANHATTAN BEACH':'ManhattanBeach','ROCKAWAY BEACH':'RockawayBeach','GRAMERCY PARK':'GramercyPark','PABLEO POINT':'PabloPoint','CARROLL GARDENS':'CarrollGardens','KEW GARDENS':'KewGardens'},regex=True)
division=[None]*len(a)
for i in range(len(a)):
    division[i]=str(a[i]).split(",",2)[0].split()[-1]
schools['neighbourhood']=division
df=schools[(schools['Economic Need Index']>0.6) | (schools['Rigorous Instruction Rating']=='Approaching Target') | (schools['Rigorous Instruction Rating']=='Not Meeting Target') | (schools['Collaborative Teachers Rating']=='Not Meeting Target') | (schools['Collaborative Teachers Rating']=='Approaching Target') | (schools['Supportive Environment Rating']=='Not Meeting Target') | (schools['Supportive Environment Rating']=='Approaching Target') | (schools['Effective School Leadership Rating']=='Not Meeting Target')| (schools['Effective School Leadership Rating']=='Approaching Target') | (schools['Strong Family-Community Ties Rating']=='Not Meeting Target') | (schools['Strong Family-Community Ties Rating']=='Approaching Target') | (schools['Trust Rating']=='Not Meeting Target') | (schools['Trust Rating']=='Approaching Target') | (schools['Student Achievement Rating']=='Not Meeting Target') | (schools['Student Achievement Rating']=='Approaching Target')]
df.head(3)

**Since grade 8 and grade 9 students can only take the SHSAT, we remove all schools which has neither of the grades (upper grade<8 or lower grade>9).**


In [None]:
df=df[(df['Grade Low']<='09') | (df['Grade High']>='08')]

In [None]:
# Since we donot need the performance for other grades, we dump those columns
df=df.drop(df.columns[41:141], axis=1) 
df=df.drop(df.columns[[0,1,2,4]],axis=1)

In [None]:
df.shape

### > Looking at null values

In [None]:
df.apply(lambda x:sum(x.isnull()))

In [None]:
df['School Income Estimate']=df['School Income Estimate'].replace({'\$':'', ',':''},regex=True).astype(float)

* **Replacing School Income Estimate nan values of Community Schools with the Community School average.**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(211)
sns.set(rc={'figure.facecolor':'lightgray'})
sns.boxplot(y=df['School Income Estimate'],x=df["Community School?"])
plt.title('School Income Estimate vs Community School?')
plt.subplot(212)
sns.set(rc={'figure.facecolor':'lightgray'})
sns.boxplot(y=df['School Income Estimate'],x=df["District"])
plt.title('School Income Estimate vs District')
plt.show()

* **Replacing the NaN values with average of schools which are not Community Schools.**

In [None]:
a=df[df['Community School?']=='Yes']['School Income Estimate'].mean()
null_index= df[df['Community School?']=='Yes']['School Income Estimate'].isnull().index.tolist()
for i in null_index:
    df.at[i,'School Income Estimate']= a

* **Replacing the remaining NaN values with district average of certain districts (which have a low variance)**

In [None]:
import numpy as np
grouped=df.groupby('District')
a=grouped['School Income Estimate'].agg(np.mean)
null_index=(df['District'].loc[df['School Income Estimate'].isnull().tolist()]== 4|5|6|7|9|12|16|17|18|19|23|26|32).index.tolist()

for i in null_index:
    df.at[i,'School Income Estimate']= a[df['District'][i]]     

* Now what do we do with the ramaining NaN values in these columns ?
* We drop columns with NaN values in these columns as all districts have Exceeding Target and Meeting Target as their majority.

In [None]:
schools[['Supportive Environment Rating','Rigorous Instruction Rating','Collaborative Teachers Rating','Student Achievement Rating','Trust Rating','Effective School Leadership Rating','Strong Family-Community Ties Rating','District']].groupby(['District'],as_index=False)[['Supportive Environment Rating','Rigorous Instruction Rating','Collaborative Teachers Rating','Student Achievement Rating','Trust Rating','Effective School Leadership Rating','Strong Family-Community Ties Rating']].agg(lambda x: x.value_counts().index[0])


In [None]:
# remove the rows with remaining NaN values
df=df.dropna()

In [None]:
df.shape

The term **'community school'** refers to a type of publicly funded school in the United States that serves as **both an educational institution and a centre of community life.** A community school is both a place and a set of partnerships between the school and other community resources. Its integrated focus on academics, youth development, family support, health and social services and community development leads to improved student learning, stronger families and healthier communities. Using public schools as hubs, community schools bring together many partners to offer a range of support and opportunities to children, youth, families and communities—before, during and after school, and on weekends.   


<font color='darkgray'>@ Wikipedia</font>

### > How many of them are not community schools?

In [None]:
len(df[df['Community School?']=='No'])

### > Where are these schools located?

In [None]:
from collections import Counter
Counter(df['neighbourhood']).most_common(10)

* **Most Schools are located in Brooklyn and Bronx where the percentage of Black and Hispanic are high.**

In [None]:
df['Percent of Students Chronically Absent']=df['Percent of Students Chronically Absent'].replace({'\%':''},regex=True).astype(float)
df['Rigorous Instruction %']=df['Rigorous Instruction %'].replace({'\%':''},regex=True).astype(float)
df['Collaborative Teachers %']=df['Collaborative Teachers %'].replace({'\%':''},regex=True).astype(float)
df['Supportive Environment %']=df['Supportive Environment %'].replace({'\%':''},regex=True).astype(float)
df['Effective School Leadership %']=df['Effective School Leadership %'].replace({'\%':''},regex=True).astype(float)
df['Strong Family-Community Ties %']=df['Strong Family-Community Ties %'].replace({'\%':''},regex=True).astype(float)
df['Trust %']=df['Trust %'].replace({'\%':''},regex=True).astype(float)
df['Student Attendance Rate']=df['Student Attendance Rate'].replace({'\%':''},regex=True).astype(float)

### > Taking a look at the neighbourhood in our dataframe

In [None]:
import numpy as np
df['Percent Black']=df['Percent Black'].replace({'%':''},regex=True).astype(float)
df['Percent Asian']=df['Percent Asian'].replace({'%':''},regex=True).astype(float)
df['Percent Hispanic']=df['Percent Hispanic'].replace({'%':''},regex=True).astype(float)
df['Percent White']=df['Percent White'].replace({'%':''},regex=True).astype(float)
df.groupby(['neighbourhood'],as_index=False)[['Percent Black','Percent Hispanic','Percent White','Percent Asian']].agg(np.mean)

### > Null values of secure dataset

In [None]:
secure.apply(lambda x:sum(x.isnull()))

### > Finding the closest GED centres to these schools

The **GED, General Educational Diploma**, is for those without a High School Diploma. Study and take a battery of tests to certify your aptitude, knowledge and skills. It is designed for those that never finished high school. Find a local test center near you. The GED, which stands for General Educational Development but is also referred to as a General Education Diploma, is a set of tests that when passed certify the test taker (American or Canadian) has met high-school level academic skills.

> * The GED Tests include five subject area tests: **Language Arts/Writing, Language Arts/Reading, Social Studies, Science, and Mathematics.**
>
>* In addition to **English**, the GED tests are available in **Spanish, French, large print, audiocassette and Braille.**
>
>* The GED credential itself is issued by the state, province or territory in which the test taker lives.
>
>* Many government institutions and universities regard the GED as the same as a high school diploma with respect to program eligibility and as a prerequisite for admissions.

In [None]:
from math import cos, asin, sqrt

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v):
    minimum=100000
    ind=0
    for index,row in data.iterrows():
            dist=distance(v['Latitude'],v['Longitude'],row['Latitude'],row['Longitude'])
            #print(minimum)
            if minimum > dist:
                          minimum=dist
                          ind=index                        
    return(data['Program Site name'][ind], data['Postcode'][ind],minimum)  

#closest(ged, df[['Latitude','Longitude']].loc[0])


In [None]:
a=[]
b=[]
e=[]
for index,row in df.iterrows():
    c,f,d=closest(ged, df[['Latitude','Longitude']].loc[index])
    a.append(c)
    e.append(f)
    b.append(d)
df['Closest GED Center']=a
df['Distance']=b
df['Postcode']=e

### > Crime Index

* Since, I am going to use the crime columns only **(Major N, Oth N, Prop N, NoCrim N, Vio N)**, I'll try and remove the NaN values from the dataset using the mean value of the districts.

In [None]:
secure=secure.dropna(subset=['Geographical District Code'])

In [None]:
plt.figure(figsize=(20,6))
sns.set(rc={'figure.facecolor':'lightgray'})
sns.boxplot(y=secure['NoCrim N'],x=secure['Geographical District Code'])
plt.title('District vs NoCrim N', size=15)
plt.show()

In [None]:
grouped=secure.groupby('Geographical District Code')
z=grouped[['Major N','Vio N','Prop N','NoCrim N','Oth N']].agg(np.mean)

In [None]:
null_index=secure[secure['Major N'].isnull()].index.tolist()
for i in null_index:
    secure.at[i,'Major N'],secure.at[i,'Vio N'],secure.at[i,'Prop N'],secure.at[i,'NoCrim N'],secure.at[i,'Oth N']=z['Major N'][secure['Geographical District Code'][i]],z['Vio N'][secure['Geographical District Code'][i]],z['Prop N'][secure['Geographical District Code'][i]],z['NoCrim N'][secure['Geographical District Code'][i]],z['Oth N'][secure['Geographical District Code'][i]]    
    

* **We create a common index called a Security index from all types of crimes committed (which includes Major crimes, Violent crimes, Property crimes, Non criminal crimes and other crimes).**

In [None]:
secure['crime index']=secure['Major N']*0.55 + secure['Vio N']*0.25 + secure['Prop N']*0.1 + secure['NoCrim N']*0.05 + secure['Oth N']*0.05

In [None]:
# obtaining the crime index of each location
grouped=secure.groupby(['Postcode'],as_index=False)['crime index'].agg(np.mean)

for code in grouped['Postcode'].tolist():
    df.at[df[df['Postcode']==code].index.tolist(),'crime index']=grouped[grouped['Postcode']==code]['crime index'].tolist()

### > Binarizing the 'Community School?' column

In [None]:
df.at[df[df['Community School?']=='Yes'].index,'Community School?']=1
df.at[df[df['Community School?']=='No'].index,'Community School?']=0

In [None]:
df3=df[['Economic Need Index', 'School Income Estimate','Community School?','Percent Asian', 'Percent Black', 'Percent Hispanic','Percent White', 'Student Attendance Rate',
       'Percent of Students Chronically Absent', 'Rigorous Instruction %','Collaborative Teachers %','Supportive Environment %', 'Effective School Leadership %',
            'Strong Family-Community Ties %','Trust %','Average ELA Proficiency',
       'Average Math Proficiency', 'Distance', 'crime index']]

In [None]:
df3.apply(lambda x:sum(x.isnull()))

In [None]:
df3.shape

# ---------------------------------------------------------------------------------------------------------------

## 1. Perform the Hierarchical Clustering
Now that we have some very simple sample data, let's do the actual clustering on it:

In [None]:
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
%matplotlib inline
np.set_printoptions(precision=5, suppress=True) 

### > Cophenetic Correlation Coefficient
With help of the cophenet() function. This compares (correlates) the actual pairwise distances of all our samples to those implied by the hierarchical clustering. The closer the value is to 1, the better the clustering preserves the original distances.

* I use the **euclidean distance metric** as this produces the **highest Cophenetic Correlation Coefficient.**

In [None]:
# generate the linkage matrix
Z = linkage(df3,'complete')
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(df3))
c

In [None]:
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()


Let's take apart the dendogram....

* On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
* On the y axis you see the distances (of the 'ward' method in our case).

Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The height of that horizontal line tells you about the distance at which this label was merged into another label or cluster.

Let's have a look at the distances of the last 10 merges:

In [None]:
Z[-10:,2]

We can also see that from distances > 21080 up there's a huge jump of the distance to the final merge at a distance of approx 61352. Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that maybe just shouldn't be merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 4 clusters here.

### > Eye Candy
Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

In [None]:
plt.figure(figsize=(25, 10))
def fancy_dendrogram(*args, **kwargs):
    max_d = kwargs.pop('max_d', None)
    if max_d and 'color_threshold' not in kwargs:
        kwargs['color_threshold'] = max_d
    annotate_above = kwargs.pop('annotate_above', 0)

    ddata = dendrogram(*args, **kwargs)
    
    if not kwargs.get('no_plot', False):
        plt.title('Hierarchical Clustering Dendrogram (truncated)')
        plt.xlabel('sample index or (cluster size)')
        plt.ylabel('distance')
        for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            if y > annotate_above:
                plt.plot(x, y, 'o', c=c)
                plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
                             textcoords='offset points',
                             va='top', ha='center')
        if max_d:
            plt.axhline(y=max_d, c='k')
    return ddata

fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=100,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,  # useful in small plots so annotations don't overlap
)
plt.show()


The above shows a truncated dendrogram, which only shows the last p=100.

### > Selecting a Distance Cut-Off aka Determining the Number of Clusters
As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If we have the chance to do this manually, i would always opt for that, as it allows us to gain some insights into the data and to perform some sanity checks on the edge cases. In our case, I would say that the cutoff is 1260 as the jump is prety obvious.

In [None]:
# set cut-off to 21080 
max_d = 23000   # max_d as in max_distance

Let's visualize this in the dendrogram as a cut-off line:

In [None]:
plt.figure(figsize=(25, 10))
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=100,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=100,
    max_d=max_d,  # plot a horizontal cut-off line
)
plt.show()

### > Automated Cut-Off Selection

Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which we can automate this.

### > Elbow Method
It tries to find the clustering step where the acceleration of distance growth is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

In [None]:
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.figure(figsize=(20,8))
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print("clusters:", k)

### > Retrieve the Clusters
**Now, let's finally have a look at how to retrieve the clusters, for different ways of determining k. We can use the fcluster function.**

**1. Knowing max_d:
Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:**

In [None]:
from scipy.cluster.hierarchy import fcluster
max_d = 23000
clusters = fcluster(Z, max_d, criterion='distance')
clusters

In [None]:
Counter(clusters)

**2. Knowing k:
Another way starting from the dendrogram is to say "i can see i have k=2" clusters. We can then use:**

In [None]:
k=2
cluster=fcluster(Z, k, criterion='maxclust')

In [None]:
fcluster(Z, k, criterion='maxclust')

** I am going to use the hierarchial clustering. **

In [None]:
z=ged[['Program Site name','Latitude','Longitude','Borough']].dropna()

import branca.colormap as cm
import folium
from folium import plugins

step = cm.StepColormap(
    ['aqua','yellow','red'],
    vmin=0.5, vmax=3.5,
    index=[0.5,1.5,2.5],
    caption='step'
)
    
step

* **Blue: CLuster 1**
* **Yellow: Cluster 2**
* **Red: Cluster 3**

In [None]:
df['Cluster']=fcluster(Z, max_d, criterion='distance')
df = df.reset_index(drop=True)
df3 = df3.reset_index(drop=True)

In [None]:
m = folium.Map([df['Latitude'][0], df['Longitude'][0]], zoom_start=9.5,tiles='cartodbdark_matter')

i=0
for lat, lon in zip(df['Latitude'], df['Longitude']):
    folium.CircleMarker([lat, lon], color=step(df['Cluster'][i]), fill=True, radius=0.9).add_to(m)    
    i+=1
m

In [None]:
grouped=df.groupby(['Cluster']).agg(np.mean)
grouped.drop(grouped.columns[1:6], axis=1)
#grouped

* **Let's neglect the cluster 4. On careful observation, we can see that Cluster 2 has the highest Percentage of Black and Hispanic students = 40.8% and 50%. Cluster 1 has the highest percentage of Asian = 20.5% and cluster 3 has the highest percentage of White = 29.25%.  **   

* **This explains the high Economic Index of cluster 2 and low Economic Index for cluster 3. Moreover, the crime index is higher in the areas of cluster 2 = 1.06. It's lowest in cluster 1 = 0.88.**   

* **The Distance of the closest GED centres is more in case of cluster 3 = 4.19.**

* **Student Attendance Rate seems to lowest for Cluster 2 = 91.34% with a higher percentage of students chronically absent = 25.77%.**

* **Average ELA and Math Proficiency are lowest for cluster 2.**

* **Fewer students have score 4 in Grade 8 in cluster 2.**     

> This gives us a rough idea of the important features coming into play.



### > Let's look at the schools in each cluster

In [None]:
grouped=df[['School Name','Cluster']].groupby(['School Name'],as_index=False).agg(lambda x:x.value_counts().index[0]).groupby(['Cluster'])
aug=grouped['School Name'].apply(lambda x: sorted(set(x)))[1]
nov=grouped['School Name'].apply(lambda x: sorted(set(x)))[2]
apr=grouped['School Name'].apply(lambda x: sorted(set(x)))[3]

z={'School Name':[aug,nov,apr], 'Cluster': [1,2,3]}
z=pd.DataFrame(z)
z.style.set_properties(**{'background-color': 'black',
                           'color': 'gold','border-color': 'white'})

# ---------------------------------------------------------------------------------------------------------------

## 2. PFA (Principal Feature Analysis) 

>PCA has the disadvantage that measurements from all the original features are used in the projection to the lower dimensional space. PFA, on the other hand performs dimensionality reduction of a feature set by choosing a subset of the original features that contains most of the essential information, using the same criteria as PCA.

### > Standardizing

In [None]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(df3)

### > Obtaining the most important features 

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA(n_components=self.q).fit(X)
        A_q = pca.components_.T

        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]

In [None]:
import numpy as np

pfa = PFA(n_features=10)
pfa.fit(X_std)

# To get the transformed matrix
X_transformed = pfa.features_

# To get the column indices of the kept features
column_indices = pfa.indices_
column_indices

### > Most important features

In [None]:
df3.columns[column_indices].values

# ---------------------------------------------------------------------------------------------------------------

## 3. Merging with th SHSAT data of District 5

In [None]:
shst=pd.read_csv('../input/data-science-for-good/D5 SHSAT Registrations and Testers.csv')
shst=shst.rename(columns={'DBN':'Location Code'})
shst=shst.drop(shst.columns[1],axis=1)

In [None]:
shst.head(5)

In [None]:
# ratio of number of students who are enrolled till end of October and who actually take the test
# participation to enrollment ratio
shst['PEratio']=shst['Number of students who took the SHSAT']/shst['Enrollment on 10/31']
# registration to enrollment ratio
shst['REratio']=shst['Number of students who registered for the SHSAT']/shst['Enrollment on 10/31']
# participation to registration ratio
shst['PRratio']=shst['Number of students who took the SHSAT']/shst['Number of students who registered for the SHSAT']

In [None]:
ged_secure=pd.merge(df,shst,on='Location Code')
ged_secure.head(3)

In [None]:
ged_secure.shape

In [None]:
ged_secure.columns

### > NaN values in the ratio

In [None]:
ged_secure[['PEratio','REratio','PRratio']].apply(lambda x:sum(x.isnull()))  

**Replacing the NaN values in PRratio by the mean value of the PRratio of Zip**

In [None]:
sns.set(rc={'figure.facecolor':'lightgray'})
sns.boxplot(y=ged_secure['PRratio'],x=ged_secure['Zip'])
plt.title('PRratio vs Zip')
plt.show()

In [None]:
import numpy as np
grouped=ged_secure.groupby('Zip')
a=grouped['PRratio'].agg(np.mean)
null_index=ged_secure[ged_secure['PRratio'].isnull()].index.tolist()

for i in null_index:
    ged_secure.at[i,'PRratio']= a[ged_secure['Zip'][i]]     

## 3.1 Hierarchial Clustering

In [None]:
df3=ged_secure[['Economic Need Index', 'School Income Estimate','Community School?','Percent Asian', 'Percent Black', 'Percent Hispanic','Percent White', 'Student Attendance Rate',
       'Percent of Students Chronically Absent', 'Rigorous Instruction %','Collaborative Teachers %','Supportive Environment %', 'Effective School Leadership %',
            'Strong Family-Community Ties %', 'Trust %','Average ELA Proficiency',
       'Average Math Proficiency', 'Distance', 'crime index','Enrollment on 10/31',
       'Number of students who registered for the SHSAT',
       'Number of students who took the SHSAT']]

In [None]:
# generate the linkage matrix
Z = linkage(df3)
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(df3))
c

In [None]:
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

In [None]:
Z[-10:,2]

* **Huge jump after distance=357**

In [None]:
max_d=400
plt.figure(figsize=(25, 10))
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=100,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=100,
    max_d=max_d,  # plot a horizontal cut-off line
)
plt.show()

* This shows that there should be 6 clusters.

In [None]:
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.figure(figsize=(20,8))
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print("clusters:", k)

### > Generate the clusters

In [None]:
from scipy.cluster.hierarchy import fcluster
max_d = 400
clusters = fcluster(Z, max_d, criterion='distance')
clusters

In [None]:
Counter(clusters)

In [None]:
ged_secure['Cluster']=fcluster(Z, max_d, criterion='distance')
ged_secure = ged_secure.reset_index(drop=True)
df3 = df3.reset_index(drop=True)

In [None]:
grouped=ged_secure.groupby(['Cluster']).agg(np.mean)
grouped.drop(grouped.columns[1:6], axis=1)
#grouped

### > Let's look at the schools in each cluster

In [None]:
grouped=ged_secure[['School Name','Cluster']].groupby(['School Name'],as_index=False).agg(lambda x:x.value_counts().index[0]).groupby(['Cluster'])
aug=grouped['School Name'].apply(lambda x: sorted(set(x)))[1]
nov=grouped['School Name'].apply(lambda x: sorted(set(x)))[2]
apr=grouped['School Name'].apply(lambda x: sorted(set(x)))[3]
mar=grouped['School Name'].apply(lambda x: sorted(set(x)))[4]
#Oct=grouped['School State'].apply(lambda x: sorted(set(x)))['Oct']
Sep=grouped['School Name'].apply(lambda x: sorted(set(x)))[5]

z={'School Name':[aug,nov,apr,mar,Sep], 'Cluster': [1,2,3,4,5]}
z=pd.DataFrame(z)
z.style.set_properties(**{'background-color': 'black',
                           'color': 'lawngreen','border-color': 'white'})

## 3.2 PCA

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(df3)
print('Covariance matrix \n%s' %x)

In [None]:
cov_mat = np.cov(x.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)


In [None]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

In [None]:
import  plotly
plotly.tools.set_credentials_file(username='RishiHazra', api_key='3WYShX1Rc0UlKTzCVggk')
import plotly.offline as py
from plotly import tools
from plotly.graph_objs import *
py.init_notebook_mode(connected=True)

tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

trace1 = Bar(
        x=['PC %s' %i for i in range(1,16)],
        y=var_exp,
        showlegend=False)

trace2 = Scatter(
        x=['PC %s' %i for i in range(1,16)], 
        y=cum_var_exp,
        name='cumulative explained variance')

data = Data([trace1, trace2])

layout=Layout(
        yaxis=YAxis(title='Explained variance in percent'),
        title='Explained variance by different principal components')

fig = Figure(data=data, layout=layout)
py.iplot(fig)

** A total of 15 dimensions account for more than 98% of the data. **

### > PFA (Principal Feature Analysis): Obtaining the most important features 

> PCA has the disadvantage that measurements from all the original features are used in the projection to the lower dimensional space. PFA, on the other hand performs dimensionality reduction of a feature set by choosing a subset of the original features that contains most of the essential information, using the same criteria as PCA. 

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(df3)

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA(n_components=self.q).fit(X)
        A_q = pca.components_.T

        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]

In [None]:
import numpy as np


pfa = PFA(n_features=15)
pfa.fit(x)

# To get the transformed matrix
X = pfa.features_

# To get the column indices of the kept features
column_indices = pfa.indices_
column_indices

### > Most important features

In [None]:
df3.columns[column_indices].values

# ---------------------------------------------------------------------------------------------------------------

## 4. Obtaining the feature importance from training

* Let us formulate a few basic assumptions based on our objective. We are to rank schools in order to encourage more paricipation among students so that the ones on the top benefit from the outreach services. 

* Henceforth, we shall consider the following as our target(Y)- values. The other features are our X-values.

* Year 2016 Ratio of number of students who appear for the test to the number of students who are enrolled 
* Year 2016 Ratio of number of students who register for the test to the number of students who are enrolled
* Year 2016 Ratio of number of students who appear for the test to the number of students who have registered
* Difference in ratio of number of students who appear for the test to the number of students who are enrolled (2013-2016)
* Difference in the ratio of number of students who register for the test to the number of students who are enrolled (2013-2016)
* Difference in ratio of number of students who appear for the test to the number of students who registered for the test (2013-2016)

I will be using two methods to rank the features.   
1) Using F-Regression   
2) Using Gradient Boosting and/or Random Forest Regressor   

### 4.1 Year 2016 Ratio of number of students who appear for the test to the number of students who are enrolled 

### > train-test split

In [None]:
from sklearn.model_selection import train_test_split
features = ged_secure[
    ['Economic Need Index','School Income Estimate','Community School?','Percent Asian', 'Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Percent of Students Chronically Absent', 'Rigorous Instruction %','Collaborative Teachers %','Supportive Environment %', 'Effective School Leadership %',
     'Strong Family-Community Ties %','Trust %','Average ELA Proficiency','Average Math Proficiency', 'Distance', 'crime index']].values
targets = ged_secure['PEratio'].values

X_train1, X_test1, y_train1, y_test1 =train_test_split(features, targets, test_size=0.1, random_state=1)
X_train1.shape, X_test1.shape, y_train1.shape, y_test1.shape

### > Using F-Regression: 

F-regression does the following:

* Start with a constant model, M0
* Try all models M1 consisting of just one feature and pick the best according to the F statistic
* Try all models M2 consisting of M1 plus one other feature and pick the best ...

In [None]:
from sklearn.feature_selection import f_regression
f_regression(X_train1, y_train1, center=True)

In [None]:
a=f_regression(X_train1, y_train1, center=True)[0]
plt.figure(figsize=[15,5])
plt.bar(range(len(a)), a,width=0.5)

import math
xint = range(0, len(a))
plt.xticks(xint)
plt.show()

**Thus, we see that the following features (which have a high F-score) are most important estimators of our target variable.     
In the order of precedence.......  **

**1) School Attendance Rate  **    
**2) Strong Family-Community Ties %   **    
**3) Percent of Students Chronically Absent  **    
**4) crime index    **    
**5) Economic Need Index   **    
**6) Percent Black   **        
**7) Percent Hispanic  **     
**8) School Income Estimate **  



### > Using Gradient Boosting and/or Random Forest Regressor

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity used to select the split points or another more specific error function. The feature importances are then averaged across all of the the decision trees within the model.

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set.

In [None]:
from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score

def regression(regressor, x_train, x_test, y_train):
    reg = regressor
    reg.fit(x_train, y_train)
    
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    
    return y_train_reg, y_test_reg

def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    print("_______________________________________")
    print(regressor)
    print("_______________________________________")
    print("R2 score. Train: ", r2_score(y_train, y_train_reg))
    print("R2 score. Test: ", r2_score(y_test, y_test_reg))
    print("---------")
    print("MSE (Train): ", mean_squared_error(y_train, y_train_reg))
    print("MSE Test: ", mean_squared_error(y_test, y_test_reg))
    print("---------")
    print("MAE (Train): ", mean_absolute_error(y_train, y_train_reg))
    print("MAE (Test): ", mean_absolute_error(y_test, y_test_reg))
    print("_______________________________________")

In [None]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
#n_estimators=The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
y_train_gbr1, y_test_gbr1 =regression(GradientBoostingRegressor(max_depth=10, n_estimators=100), 
           X_train1, X_test1, y_train1)

scores('Gradient Boosting Regressor \nratio of no. of students who are enrolled to the no. of students who take the test', 
       y_train1, y_test1, y_train_gbr1, y_test_gbr1)

In [None]:
# n_estimators=The number of trees in the forest
y_train_rfr1, y_test_rfr1 =regression(RandomForestRegressor(n_estimators=22), X_train1, X_test1, y_train1)

scores('Random Forest Regressor \nratio',y_train1, y_test1, y_train_rfr1, y_test_rfr1)

In [None]:
# plot feature importances
model1=GradientBoostingRegressor(max_depth=7, n_estimators=100).fit(X_train1, y_train1)
model2=RandomForestRegressor(n_estimators=22).fit(X_train1, y_train1)
plt.figure(figsize=[15,5])
plt.bar(range(len(model1.feature_importances_)), model1.feature_importances_, width=0.5)
plt.bar(range(len(model2.feature_importances_)), -model2.feature_importances_, width=0.5)

import math
import matplotlib.patches as mpatches
xint = range(0, len(model1.feature_importances_))
plt.xticks(xint)
blue_patch = mpatches.Patch(color='b', label='Gradient Boosting')
green_patch = mpatches.Patch(color='g', label='Random Forest')
plt.legend(handles=[blue_patch,green_patch])
plt.show()

Thus, we see that the following features (which have high importance in either of the methods) are most important estimators of our target variable.
**'School Attendance Rate','Percent of Students Chronically Absent','Percent White','Economic Need Index','Distance','crime index','Average ELA Proficiency','Percent Black','Percent Hispanic'**.


### 4.2 Year 2016 Ratio of number of students who register for the test to the number of students  who are enrolled 

### > train-test split

In [None]:
targets = ged_secure['REratio'].values
features = ged_secure[
    ['Economic Need Index','School Income Estimate','Community School?','Percent Asian', 'Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Percent of Students Chronically Absent', 'Rigorous Instruction %','Collaborative Teachers %','Supportive Environment %', 'Effective School Leadership %',
     'Strong Family-Community Ties %','Trust %','Average ELA Proficiency','Average Math Proficiency', 'Distance', 'crime index']].values
X_train1, X_test1, y_train1, y_test1 =train_test_split(features, targets, test_size=0.1, random_state=1)
X_train1.shape, X_test1.shape, y_train1.shape, y_test1.shape

### > Using F-Regression:

In [None]:
from sklearn.feature_selection import f_regression
f_regression(X_train1, y_train1, center=True)

In [None]:
a=f_regression(X_train1, y_train1, center=True)[0]
plt.figure(figsize=[15,5])
plt.bar(range(len(a)), a,width=0.5)

import math
xint = range(0, len(a))
plt.xticks(xint)
plt.show()

**Thus, we see that the following features (which have a high F-score) are most important estimators of our target variable.     
In the order of precedence.......  **

**1) Distance  **    
**2) Percent Hispanic   **    
**3) Percent Black  **    
**4) Supportive Environment %   **    
**5) Economic Need Index   **    
**6) Collaborative Teachers %   **        
**7) Student Attendance Rate  **     
**8) crime index **   
**9) Percent White **

### > Using Gradient Boosting and/or Random Forest Regressor
Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity used to select the split points or another more specific error function. The feature importances are then averaged across all of the the decision trees within the model.

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set.

In [None]:
y_train_gbr1, y_test_gbr1 =regression(GradientBoostingRegressor(max_depth=5, n_estimators=100), 
           X_train1, X_test1, y_train1)

scores('Gradient Boosting Regressor \nratio of no. of students who are enrolled to the no. of students who take the test', 
       y_train1, y_test1, y_train_gbr1, y_test_gbr1)

In [None]:
# n_estimators=The number of trees in the forest
y_train_rfr1, y_test_rfr1 =regression(RandomForestRegressor(n_estimators=25), 
           X_train1, X_test1, y_train1)

scores('Random Forest Regressor \nratio',y_train1, y_test1, y_train_rfr1, y_test_rfr1)

In [None]:
# plot feature importances

model1=GradientBoostingRegressor(max_depth=7, n_estimators=100).fit(X_train1, y_train1)
model2=RandomForestRegressor(n_estimators=22).fit(X_train1, y_train1)
plt.figure(figsize=[15,5])
plt.bar(range(len(model1.feature_importances_)), model1.feature_importances_, width=0.5)
plt.bar(range(len(model2.feature_importances_)), -model2.feature_importances_, width=0.5)

import math
xint = range(0, len(model1.feature_importances_))
plt.xticks(xint)
blue_patch = mpatches.Patch(color='b', label='Gradient Boosting')
green_patch = mpatches.Patch(color='g', label='Random Forest')
plt.legend(handles=[blue_patch,green_patch])
plt.show()

Thus, we see that the following features (which have high importance in either of the methods) are most important estimators of our target variable.
**'Distance','Percent Black','Percent Hispanic','Economic Need Index','Distance','Strong Family-Community Ties %','Trust %',**

### 4.3 Year 2016 Ratio of number of students who appear for the to the number of students who are registered for the test

### > train-test split

In [None]:
ged_secure1=ged_secure.dropna()
targets = ged_secure1['PRratio'].values
features = ged_secure1[
    ['Economic Need Index','School Income Estimate','Community School?','Percent Asian', 'Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Percent of Students Chronically Absent', 'Rigorous Instruction %','Collaborative Teachers %','Supportive Environment %', 'Effective School Leadership %',
     'Strong Family-Community Ties %','Trust %','Average ELA Proficiency','Average Math Proficiency', 'Distance', 'crime index']].values
X_train1, X_test1, y_train1, y_test1 =train_test_split(features, targets, test_size=0.1, random_state=1)
X_train1.shape, X_test1.shape, y_train1.shape, y_test1.shape

### > Using F-Regression:

In [None]:
from sklearn.feature_selection import f_regression
f_regression(X_train1, y_train1, center=True)

In [None]:
a=f_regression(X_train1, y_train1, center=True)[0]
plt.figure(figsize=[15,5])
plt.bar(range(len(a)), a,width=0.5)

import math
xint = range(0, len(a))
plt.xticks(xint)
plt.show()

**Thus, we see that the following features (which have a high F-score) are most important estimators of our target variable.     
In the order of precedence.......  **

**1) Distance  **    
**2) Percent Hispanic   **    
**3) Percent Black  **    
**4) Economic Need Index   **    
**5) Average Math Proficiency   **   

### > Using Gradient Boosting and/or Random Forest Regressor
Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity used to select the split points or another more specific error function. The feature importances are then averaged across all of the the decision trees within the model.

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set.


In [None]:
y_train_gbr1, y_test_gbr1 =regression(GradientBoostingRegressor(max_depth=5, n_estimators=100), 
           X_train1, X_test1, y_train1)

scores('Gradient Boosting Regressor \nratio of no. of students who are enrolled to the no. of students who take the test', 
       y_train1, y_test1, y_train_gbr1, y_test_gbr1)


In [None]:
# n_estimators=The number of trees in the forest
y_train_rfr1, y_test_rfr1 =regression(RandomForestRegressor(n_estimators=25), 
           X_train1, X_test1, y_train1)

scores('Random Forest Regressor \nratio', 
       y_train1, y_test1, y_train_rfr1, y_test_rfr1)

# plot feature importances

In [None]:
model1=GradientBoostingRegressor(max_depth=7, n_estimators=100).fit(X_train1, y_train1)
model2=RandomForestRegressor(n_estimators=22).fit(X_train1, y_train1)
plt.figure(figsize=[15,5])
plt.bar(range(len(model1.feature_importances_)), model1.feature_importances_, width=0.5)
plt.bar(range(len(model2.feature_importances_)), -model2.feature_importances_, width=0.5)

import math
xint = range(0, len(model1.feature_importances_))
plt.xticks(xint)
import matplotlib.patches as mpatches

blue_patch = mpatches.Patch(color='b', label='Gradient Boosting')
green_patch = mpatches.Patch(color='g', label='Random Forest')
plt.legend(handles=[blue_patch,green_patch])
plt.show()

Thus, we see that the following features (which have high importance in either of the methods) are most important estimators of our target variable. **Economic Need Index , Percent Hispanic, Percent Black, Distance**.


### > From the above methods, we conclude that the most important features are as follows:

* **Economic Need Index**
* **School Attendance Rate**
* **Distance**
* **crime index**
* **Percent Black**
* **Percent Hispanic**
* **Percent White (plays a major role as minority)**

# ---------------------------------------------------------------------------------------------------------------

## 5. Obtaining feature weights

>Before proceeding further, I'll normalize all features and bring them on the same scale so that none of the features acts to overpower the others. I'll be dividing all the features by their max values.

In [None]:
features = ged_secure1[
    ['Economic Need Index','Community School?','Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Distance', 'crime index']]
min_vec=features.min(axis=0)
max_vec=features.max(axis=0)

In [None]:
features_normalized=(features.sub(features.mean(axis=0), axis=1))/ features.std(axis=0)

In [None]:
targets = ged_secure1['PEratio'].values
a=f_regression(features_normalized.values,targets, center=True)[0]
f_regression(features_normalized.values,targets, center=True)[0]

In [None]:
targets = ged_secure1['PRratio'].values
b=f_regression(features_normalized.values,targets, center=True)[0]
f_regression(features_normalized.values,targets, center=True)[0]

In [None]:
targets = ged_secure1['REratio'].values
c=f_regression(features_normalized.values,targets, center=True)[0]
f_regression(features_normalized.values,targets, center=True)[0]

### > I shall consider the average of all the F-scores as weights of the features.

In [None]:
weights=(a+b+c)/3

# ---------------------------------------------------------------------------------------------------------------

## 6. Ranking the schools

### > Normalizing the whole dataset

In [None]:
features = df[['School Name', 'Location Code', 'District', 'Latitude', 'Longitude',
       'Address (Full)', 'City', 'Zip','Economic Need Index','Community School?','Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Distance', 'crime index']]
min_vec=features[['Economic Need Index','Community School?','Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Distance', 'crime index']].min(axis=0)
max_vec=features[['Economic Need Index','Community School?','Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Distance', 'crime index']].max(axis=0)

features_normalized=(features[['Economic Need Index','Community School?','Percent Black', 'Percent Hispanic','Percent White', 
     'Student Attendance Rate','Distance', 'crime index']].sub(min_vec, axis=1))/ (max_vec-min_vec)

### > Weighed sum of features

In [None]:
weighed_features=features_normalized*weights
features['Score']=weighed_features.sum(axis=1)
features=features.sort_values('Score', ascending=False).drop_duplicates('School Name')

In [None]:
features1=features[:10]
features1

### > Top 10 Schools which require intervention

In [None]:
map_1 = folium.Map(location=[40.755048, -73.926963],
                   zoom_start=9.5,
                   tiles='cartodbdark_matter')
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[0], popup=features1.reset_index(drop=False)['School Name'].loc[0]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[1], popup=features1.reset_index(drop=False)['School Name'].loc[1]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[2], popup=features1.reset_index(drop=False)['School Name'].loc[2]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[3], popup=features1.reset_index(drop=False)['School Name'].loc[3]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[4], popup=features1.reset_index(drop=False)['School Name'].loc[4]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[5], popup=features1.reset_index(drop=False)['School Name'].loc[5]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[6], popup=features1.reset_index(drop=False)['School Name'].loc[6]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[7], popup=features1.reset_index(drop=False)['School Name'].loc[7]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[8], popup=features1.reset_index(drop=False)['School Name'].loc[8]).add_to(map_1)
folium.Marker(features1.reset_index(drop=False)[['Latitude','Longitude']].loc[9], popup=features1.reset_index(drop=False)['School Name'].loc[9]).add_to(map_1)


i=0
for lat, lon in zip(df['Latitude'], df['Longitude']):
    folium.CircleMarker([lat, lon], color=step(df['Cluster'][i]), fill=True, radius=0.9).add_to(map_1)    
    i+=1

map_1

**The top 10 schools represented along with the previous grouping generated by hierarchial clustering.**   

**Click on the marker to the know the name of the Schools.**

In [None]:
z=ged[['Program Site name','Latitude','Longitude','Borough']].dropna()

import branca.colormap as cm
import folium
from folium import plugins

step = cm.StepColormap(
    ['aqua','yellow','red'],
    vmin=0.5, vmax=3.5,
    index=[0.5,1.5,2.5],
    caption='step'
)
    
step

* **Blue: CLuster 1**
* **Yellow: Cluster 2**
* **Red: Cluster 3**

# ---------------------------------------------------------------------------------------------------------------

## 7. Conclusion:
As is evident from the mapping, most of the schools belong to Cluster 2 which has the highest Percentage of Black and Hispanic students = 40.8% and 50% (as deduced earlier). Moreover, the crime index is higher in the areas of cluster 2 = 1.06. Student Attendance Rate seems to lowest for Cluster 2 = 91.34% with a higher percentage of students chronically absent = 25.77%. Average ELA and Math Proficiency are lowest for cluster 2. ewer students have score 4 in Grade 8 in cluster 2. Also, cluster 2 has the highest Economic Index.    

Thus this is consistent with our clustering results.

# ............................................................................................................................

# This kernel is still in development process. Feel free to post your suggestions. Thank You.

# ................................................................................................................................