# 6.2 Regression

1. Load the Galton dataset into a Pandas dataframe?
    *  http://www.randomservices.org/random/data/Galton.html
    
2. Summarize the dataset:
    * Number of rows
    * Average height of male/female kids
    * Std deviation of male/female kids
    
3. Create a training and test dataset. The test dataset should be at least 25%.

4. Create 2 regression models: for predicting the childs height based on (i) father height and (ii) mother's height!

5. Compute the model quality parameters: $R^{2}$ and $MSE$! 

6. Create a multi-variate regression model including both the mother and father height as features! How does the $R^{2}$ change?

7. Create a Spark MLlib model for the same task!

References: 
* http://scikit-learn.org/stable/modules/linear_model.html
* http://scikit-learn.org/stable/model_selection.html
* <http:///pygot.wordpress.com/2017/03/25/simple-linear-regression-with-galton/>
* <https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#linear-regression>

In [46]:
%matplotlib inline
import csv
import requests # pip install requests for easy http request for CSV data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [28]:
df=pd.read_csv("http://www.randomservices.org/random/data/Galton.txt", sep="\t")
print(df.head())
# Number of rows
print("Number of rows: ",len(df.index))
# Average height of male/female kids
print("\nAvg height:\n",df.groupby('Gender')['Height'].mean())
# Std deviation of male/female kids
print("\nStddev height:\n",df.groupby('Gender')['Height'].std())

  Family  Father  Mother Gender  Height  Kids
0      1    78.5    67.0      M    73.2     4
1      1    78.5    67.0      F    69.2     4
2      1    78.5    67.0      F    69.0     4
3      1    78.5    67.0      F    69.0     4
4      2    75.5    66.5      M    73.5     4
Number of rows:  898

Avg height:
 Gender
F    64.110162
M    69.228817
Name: Height, dtype: float64

Stddev height:
 Gender
F    2.370320
M    2.631594
Name: Height, dtype: float64


In [42]:
# Create a training and test dataset. The test dataset should be at least 25%.
X_train_father, X_test_father, Y_train_father, Y_test_father = \
    train_test_split(df['Father'],df['Height'], test_size=0.25, random_state=42)
print("Father "+str(len(df))+" full = "+str(len(X_train_father))+" train + "+str(len(X_test_father))+" test")
X_train_father = X_train_father[:,np.newaxis]
X_test_father = X_test_father[:,np.newaxis]

X_train_mother, X_test_mother, Y_train_mother, Y_test_mother = \
    train_test_split(df['Mother'],df['Height'], test_size=0.25, random_state=42)
print("Mother "+str(len(df))+" full = "+str(len(X_train_mother))+" train + "+str(len(X_test_mother))+" test")
X_train_mother = X_train_mother[:,np.newaxis]
X_test_mother = X_test_mother[:,np.newaxis]


Father 898 full = 673 train + 225 test
Mother 898 full = 673 train + 225 test


In [43]:
# Create 2 regression models: for predicting the childs height based on (i) father height and (ii) mother's height!
# (i)
print("Father:")
model_father = LinearRegression(fit_intercept=True)
model_father.fit(X_train_father,Y_train_father) # X is row/col fmt, y is vector
print('Coefficient: \n', model_father.coef_)
print('Intercept: \n', model_father.intercept_)

#(ii)
print("\nMother:")
# (i)
model_mother = LinearRegression(fit_intercept=True)
model_mother.fit(X_train_mother,Y_train_mother) # X is row/col fmt, y is vector
print('Coefficient: \n', model_mother.coef_)
print('Intercept: \n', model_mother.intercept_)

Father:
Coefficient: 
 [0.42198842]
Intercept: 
 37.586619352188876

Mother:
Coefficient: 
 [0.29157075]
Intercept: 
 48.10734824917103


In [48]:
# Compute the model quality parameters: $R^{2}$ and $MSE$! 
father_pred = model_father.predict(X_test_father)
print("Father R2 score: ",r2_score(Y_test_father, father_pred))
print("Father MSE score: ",mean_squared_error(Y_test_father, father_pred))

mother_pred = model_father.predict(X_test_mother)
print("\nMother R2 score: ",r2_score(Y_test_mother, mother_pred))
print("Mother MSE score: ",mean_squared_error(Y_test_mother, mother_pred))

Father R2 score:  0.04131337596295048
Father MSE score:  11.565292440067221

Mother R2 score:  -0.28738778694023237
Mother MSE score:  15.530639383531586


In [60]:
# Create a multi-variate regression model including both the mother and father height as features! How does the $R^{2}$ change?
fulltrain, fulltest = np.split(df, [int(.75*len(df))])
x_train = fulltrain[['Father', 'Mother']]
y_train = fulltrain['Height']
x_test = fulltest[['Father', 'Mother']]
y_test = fulltest['Height']

print("Both "+str(len(df))+" full = "+str(len(x_train))+" train + "+str(len(x_test))+" test")

ols = LinearRegression()
model_both = ols.fit(x_train, y_train)

both_pred = model_both.predict(x_test)
print("\nBoth R2 score: ",r2_score(y_test, both_pred))
print("Both MSE score: ",mean_squared_error(y_test, both_pred))

Both 898 full = 673 train + 225 test

Both R2 score:  0.014636787597684497
Both MSE score:  10.96364256629657


- Kleinerer $R^2$ Wert

In [61]:
#Create a Spark MLlib model for the same task!
SPARK_MASTER="local[1]"
#SPARK_MASTER="spark://mpp3r03c04s06.cos.lrz.de:7077"
APP_NAME = "PySpark Lecture Herget"
# If there is no SparkSession, create the environment
# os.environ["PYSPARK_PYTHON"] = "/naslx/projects/ug201/di57hah/anaconda2/envs/python3/bin/python"

try:
    sc and spark
except NameError as e:
  #import findspark
  #findspark.init()
    import pyspark
    import pyspark.sql
    from pyspark.ml.feature import FeatureHasher, VectorAssembler
    from pyspark.ml.classification import LogisticRegression
    conf=pyspark.SparkConf().set("spark.cores.max", "4")
    sc = pyspark.SparkContext(master=SPARK_MASTER, conf=conf)
    spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

print("PySpark initiated...")

PySpark initiated...


In [67]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Build the model
model = LinearRegression.train(x_train, iterations=100, step=0.00000001)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds \
    .map(lambda vp: (vp[0] - vp[1])**2) \
    .reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))


AttributeError: type object 'LinearRegression' has no attribute 'train'

In [63]:
## Logisitic Regression
data_spark = spark.createDataFrame(df)

assembler = VectorAssembler(
    inputCols=["Father", "Mother", "Height"],
    outputCol="features")

result_spark = assembler.transform(data_spark)

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial", labelCol="Height")

model = lr.fit(result_spark)
print(model)

Py4JJavaError: An error occurred while calling o133.fit.
: org.apache.spark.SparkException: Classification labels should be in [0 to 79]. Found 298 invalid labels.
	at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:564)
	at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:488)
	at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:278)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
