![Python_logo](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png)


   # **Cortex Game: Round1--Amount**

> Before playing the game, you need to connect to SASPy first.
>
>> If it is your first time, please follow the 4 steps mentioned below!

***
## **Connect to SASPy**

**1- Make sure that your Python version is 3.3 or higher**

In [1]:
from platform import python_version
print (python_version())

**2- Install SASPy**

In [2]:
pip install saspy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**3- Make sure that the configuration file "sascfg_personal.py" is correctly created**

In [3]:
import saspy, os
print(saspy.__file__.replace('__init__.py', 'sascfg_personal.py'))

C:\Users\leGalane\anaconda3\lib\site-packages\saspy\sascfg_personal.py


**4- Establish Connection (Need to do this step each time you use SASPy)**

In [4]:
import saspy
sas_session = saspy.SASsession()
sas_session

Using SAS Config named: oda
Error trying to read authinfo file:C:\Users\leGalane\_authinfo
[Errno 2] No such file or directory: 'C:\\Users\\leGalane\\_authinfo'
Did not find key oda in authinfo file:C:\Users\leGalane\_authinfo

Please enter the OMR user id: a01651812@tec.mx
Please enter the password for OMR user : ········
SAS Connection established. Subprocess id is 65836



Access Method         = IOM
SAS Config name       = oda
SAS Config file       = C:\Users\leGalane\anaconda3\lib\site-packages\saspy\sascfg_personal.py
WORK Path             = /saswork/SAS_work84AA0000E7CA_odaws01-usw2-2.oda.sas.com/SAS_work0B330000E7CA_odaws01-usw2-2.oda.sas.com/
SAS Version           = 9.04.01M6P11072018
SASPy Version         = 4.4.1
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf-8
SAS process Pid value = 59338


***
## Connect to Cortex Data Sets

Load Cortex datasets from SAS Studio

In [5]:
%%SAS sas_session
libname cortex '~/my_shared_file_links/u39842936/Cortex Data Sets';


## Transform cloud SAS dataset to Python dataframe (pandas)


> **For reference**:

> 1. [Pandas library](https://pandas.pydata.org/docs/user_guide/index.html)

> 2. [sklearn.model_selection for data partition](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [6]:
import pandas as pd

data1 = sas_session.sasdata2dataframe(
table='hist',
libref='cortex'
)

data2 = sas_session.sasdata2dataframe(
table='target_rd1',
libref='cortex'
)

## Merge the Data

In [7]:
data_merge = pd.merge(data1, data2, on=["ID"],how="right")
data_merge.sample(2)

#data_merge.head()


Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,...,Recency,Frequency,Seniority,TotalGift,MinGift,MaxGift,GaveLastYear,AmtLastYear,GaveThisYear,AmtThisYear
815264,2815265.0,SIMMONS,WILLIAM,0.0,44.0,71100.0,University / College,Suburban,0.0,0.0,...,,,,,,,0.0,0.0,0.0,0.0
855645,2855646.0,HERNANDEZ,MARY,1.0,62.0,25400.0,University / College,Downtown,6.0,0.0,...,,,,,,,0.0,0.0,0.0,0.0


## Treat Missing Values

> Please be aware that deleting all missing values can induce a selection bias. 
Some missing values are very informative. For example, when MinGift is missing, it means that the donor never gave in the past 10 years (leading to but excluding last year). Instead of deleting this information, replacing it by 0 is more appropriate!

> A good understanding of the business case and the data can help you come up with more appropriate strategies to deal with missing values.


In [8]:
# In this case, we are replacing MinGift by 0.
# You can do the same for what you think is reasonable for dealing with the other variables.

data_merge[['MinGift']] = data_merge[['MinGift']].fillna(value=0)  

data_merge.sample(3)

Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,...,Recency,Frequency,Seniority,TotalGift,MinGift,MaxGift,GaveLastYear,AmtLastYear,GaveThisYear,AmtThisYear
148524,2148525.0,WOOLDRIDGE,SCOTT,0.0,63.0,7300.0,University / College,Downtown,8.0,5.0,...,5.0,3.0,7.0,50.0,10.0,20.0,0.0,0.0,0.0,0.0
809191,2809192.0,SIMON,MAUREEN,1.0,43.0,32400.0,University / College,Rural,7.0,0.0,...,,,,,0.0,,0.0,0.0,0.0,0.0
603943,2603944.0,ALLEN,REBECCA,1.0,32.0,142900.0,University / College,Suburban,0.0,0.0,...,,,,,0.0,,0.0,0.0,0.0,0.0


## Data Partition

In [9]:
# The code below is an illustration on how to sample data on train and validation samples.
# You could use another library or a built-in function to perform sampling.

from sklearn.model_selection import train_test_split
train, validation = train_test_split(data_merge, test_size=0.4, random_state=12345) 

#train.head()
train.sample(2)

Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,...,Recency,Frequency,Seniority,TotalGift,MinGift,MaxGift,GaveLastYear,AmtLastYear,GaveThisYear,AmtThisYear
154109,2154110.0,PORTILLO,VERNA,1.0,53.0,47100.0,University / College,Rural,6.0,0.0,...,,,,,0.0,,0.0,0.0,0.0,0.0
104521,2104522.0,PADILLA,JOHN,0.0,29.0,119400.0,University / College,City,4.0,0.0,...,,,,,0.0,,0.0,0.0,0.0,0.0


## Prebuilt Models
***

### **Linear Regression Model**


> The [sk-learn library](https://scikit-learn.org/stable/index.html ) offers more advanced models.


In [10]:
from sklearn import linear_model

#comment: it's numpy array
X_train = train[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities' ]] 
Y_train = train['AmtThisYear']
X_valid = validation[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities']] 
Y_valid = validation['AmtThisYear']

regr = linear_model.LinearRegression()
regr.fit(X_train,Y_train)
regr_predict=regr.predict(X_valid)

In [11]:
#you can change the criteria

import numpy as np
from sklearn import metrics
#MAE
print(metrics.mean_absolute_error(Y_valid,regr_predict))
#MSE
print(metrics.mean_squared_error(Y_valid,regr_predict))
#RMSE
print(np.sqrt(metrics.mean_squared_error(Y_valid,regr_predict)))

13.288003475714971
7607.206593588201
87.21930172609845


## **Regression Tree Model**

In [12]:
from sklearn.tree import DecisionTreeRegressor

X_train = train[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities']] 
Y_train = train['AmtThisYear']
X_valid = validation[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities']] 
Y_valid = validation['AmtThisYear']

DT_model = DecisionTreeRegressor(max_depth=5).fit(X_train,Y_train)

DT_predict = DT_model.predict(X_valid) #Predictions on Testing data


In [13]:
#you can change the criteria
#MAE
print(metrics.mean_absolute_error(Y_valid,DT_predict))
#MSE
print(metrics.mean_squared_error(Y_valid,DT_predict))
#RMSE
print(np.sqrt(metrics.mean_squared_error(Y_valid,DT_predict)))

13.269056773933272
7611.489839474201
87.24385273172088


### **Other models may also be helpful for this game**

Reference: https://scikit-learn.org/stable/supervised_learning.html

***


# My Models For Prediction

In [14]:
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

In [15]:
train = train[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities', 'AmtThisYear']] 
validation = validation[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities', 'AmtThisYear']]

In [16]:
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
  
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X_train)
  
    distortions.append(sum(np.min(cdist(X_train, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X_train.shape[0])
    inertias.append(kmeanModel.inertia_)
  
    mapping1[k] = sum(np.min(cdist(X_train, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X_train.shape[0]
    mapping2[k] = kmeanModel.inertia_

KeyboardInterrupt: 

In [None]:
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

In [None]:
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

In [None]:
#We use 4 as our K for agrupation, then we generate K lineal regression models
kmeanModel = KMeans(n_clusters = 4)
kmeanModel.fit(train.drop('GroupLabel',axis=1))

train["GroupLabel"] = kmeanModel.labels_
train.head()

In [None]:
from sklearn import linear_model

In [None]:
X_train = train[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities', 'GroupLabel']] 
Y_train = train[['AmtThisYear', "GroupLabel"]]

In [None]:
lineal_regressions = {}

for k in range(4):
    regr = linear_model.LinearRegression()
    X_train_k = X_train[X_train["GroupLabel"] == k].drop('GroupLabel', axis=1)
    Y_train_k = Y_train[Y_train["GroupLabel"] == k].drop('GroupLabel', axis=1)
    #print(X_train_k.head())
    
    regr.fit(X_train_k,Y_train_k)
    lineal_regressions[f'Group{k}'] = regr

In [None]:
validation

In [None]:
validation["GroupLabel"] = kmeanModel.predict(validation)
validation.head()

In [None]:
X_valid = validation[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities', 'GroupLabel']] 
Y_valid = validation['AmtThisYear']

In [None]:
X_valid["Pred"] = np.nan
X_valid

In [None]:
afajsdhfahdsfl;jkalsdjkfa

In [None]:
%%time
for k in range(4):
    preds = lineal_regressions[f"Group{k}"].predict(X_valid[X_valid["GroupLabel"] == k].drop(["GroupLabel", "Pred"], axis=1))
    i = 0
    for index, row in X_valid[X_valid["GroupLabel"] == k].iterrows():
        X_valid.loc[index, "Pred"] = preds[i][0]
        i += 1

In [None]:
X_valid.info()

In [None]:
preds_km = X_valid["Pred"]

In [None]:
import numpy as np
from sklearn import metrics
#MAE
print(metrics.mean_absolute_error(Y_valid,X_valid["Pred"]))
#MSE
print(metrics.mean_squared_error(Y_valid,X_valid["Pred"]))
#RMSE
print(np.sqrt(metrics.mean_squared_error(Y_valid,X_valid["Pred"])))

In [20]:
#Bayesian Ridge Regression
from sklearn import linear_model

#comment: it's numpy array
X_train_b = train[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities' ]] 
Y_train_b = train['AmtThisYear']
X_valid_b = validation[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities']] 
Y_valid_b = validation['AmtThisYear']

BRR_regr = linear_model.BayesianRidge()
BRR_regr.fit(X_train,Y_train)
BRR_regr_predict=BRR_regr.predict(X_valid)

#MAE
print(metrics.mean_absolute_error(Y_valid,BRR_regr_predict))
#MSE
print(metrics.mean_squared_error(Y_valid,BRR_regr_predict))
#RMSE
print(np.sqrt(metrics.mean_squared_error(Y_valid,BRR_regr_predict)))

13.288496008686197
7607.204160981126
87.21928778074908


In [None]:
sum(regr_predict)

## Scoring New Data

### Prepare data for scoring

In [18]:
data3 = sas_session.sasdata2dataframe(
table='score_rd1',
libref='cortex'
)
data4 = sas_session.sasdata2dataframe(
table='score',
libref='cortex'
)

 ### Score new data based on your champion model
 
> Pick your champion model from previous steps and use it to predict next year donations. 
 
> In this case, the linear regression model performed better than the regression tree based on the MSE criterion.

In [19]:
scoring_data = pd.merge(data3, data4, on=["ID"],how="right")

# Perform the same strategy for handling missing values for the score dataset.
# In this case, we will only replace missing values of the MinGift variable.

scoring_data[['MinGift']] = scoring_data[['MinGift']].fillna(value=0) 

scoring_data.head()

Unnamed: 0,ID,GaveLastYear,AmtLastYear,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,Referrals,Recency,Frequency,Seniority,TotalGift,MinGift,MaxGift
0,2000001.0,0.0,0.0,ROMMES,RODNEY,0.0,25.0,107200.0,University / College,City,2.0,0.0,0.0,1.0,2.0,2.0,1010.0,10.0,1000.0
1,2000002.0,0.0,0.0,RAMIREZ,SHARON,1.0,38.0,15800.0,High School,Rural,4.0,1.0,1.0,,,,,0.0,
2,2000003.0,0.0,0.0,TSOSIE,KAREN,1.0,37.0,57400.0,University / College,Rural,5.0,0.0,0.0,,,,,0.0,
3,2000004.0,0.0,0.0,LEE,MARY,1.0,78.0,23700.0,High School,Rural,3.0,0.0,0.0,,,,,0.0,
4,2000005.0,0.0,0.0,HUMPHRES,ANGIE,1.0,34.0,71900.0,University / College,Rural,8.0,0.0,0.0,,,,,0.0,


In [21]:
# In this case, based on MSE (Mean Squared Error) criterion,
# the linear regression model performed better than the regression tree.

X = scoring_data[['Age', 'Salary','MinGift', 'AmtLastYear','Woman', 'NbActivities']] 
BRR_regr_predict_end=BRR_regr.predict(X)

scoring_data['Prediction'] = BRR_regr_predict_end
scoring_data.sort_values(by=['Prediction'], inplace=True,ascending=False)
scoring_data.head()

Unnamed: 0,ID,GaveLastYear,AmtLastYear,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,Referrals,Recency,Frequency,Seniority,TotalGift,MinGift,MaxGift,Prediction
420890,2420891.0,1.0,9000.0,BEIL,MARGARET,1.0,37.0,104200.0,Elementary,Downtown,10.0,5.0,5.0,7.0,1.0,7.0,7000.0,7000.0,7000.0,203.648571
631673,2631674.0,1.0,10000.0,KOPPENHEFFER,JENNIFER,1.0,34.0,186500.0,University / College,City,9.0,3.0,1.0,0.0,1.0,0.0,500.0,500.0,500.0,157.136945
334249,2334250.0,1.0,10000.0,MANLEY,COLLEEN,1.0,36.0,108300.0,University / College,Suburban,6.0,4.0,3.0,5.0,2.0,6.0,45.0,20.0,25.0,154.611308
954313,2954314.0,1.0,10000.0,SANCHEZ,JADA,1.0,37.0,222700.0,High School,Suburban,10.0,2.0,1.0,1.0,4.0,9.0,95.0,10.0,40.0,150.004245
416110,2416111.0,1.0,10000.0,GOLDSTEIN,MICHELLE,1.0,27.0,46900.0,University / College,City,5.0,3.0,1.0,1.0,3.0,3.0,150.0,50.0,50.0,149.64847


## Exporting Results to a CSV File

In [22]:
Result= scoring_data[['ID','Prediction']]
#Result.to_csv('Round1_Output.csv', index=False)

In [23]:
# Define your cutoff and choose a number of rows to submit to the leaderboard

NB = 10000
submission = Result.head(NB)
submission.to_csv('Round1 Output.csv', index=False)

In [None]:
# Reminder: Please note that you need only one column (the list of donors' IDs) to submit to the leaderboard.
