# LOGISTIC REGRESSION with PYSPARK on Windows
# ==============================================================

# Beginning
* Open `PuTTY`
* Insert your Cloudera IP address on *Host Name (or IP address)*
* Leave all other settings by default
* Click Open

<img src="ss/1 PuttyConf.PNG", height="500", width="300">

* Insert Your username on Cloudera
* Insert Your Password

<img src="ss/2 login.PNG">

* Set the source to `/tmp/source_profile`

**CODE** : <br>
>```python
source /tmp/source_profile
```

<img src="ss/3 sourcee.PNG">

# Start Playing in PYSPARK

* Run **Pyspark**

**CODE** : <br>
>```python
pyspark2
```

<img src="ss/4 pyspark2.PNG">

* ### Import Library

**CODE** : <br>
>```python
from pyspark.sql import SQLContext
```

<img src="ss/5 import  lib.PNG">

* ###  Load the Dataset

**CODE** : <br>
>```python
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/user/cloudera/clean_tweet.csv')
```


<img src="ss/6 load dataset.PNG">

* ### Show Head of Dataset

**CODE** : <br>
>```python
df.show(5)
```

<img src="ss/7 show.PNG">

* ### Drop Missing Value

**CODE** : <br>
>```python
df = df.dropna()
df.count()
```

<img src="ss/8 missing value.PNG">

* ### Split Dataset to Train & Test

**CODE** : <br>
>```python
(train_set, test_set) = df.randomSplit([0.8, 0.2], seed = 2000)
```

<img src="ss/9 train test split.PNG">

***
# ------------------------- HashingTF + IDF + Logistic Regression ------------------------
***

* ### Import Library 

**CODE** : <br>
>```python
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
```

<img src="ss/10 import lib for HashingTF + IDF + Logistic Regression.PNG">

* ### Set The TF-IDF Vectorizer

**CODE** : <br>
>```python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashtf = HashingTF(numFeatures=2**16, inputCol="words", outputCol='tf')
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
label_stringIdx = StringIndexer(inputCol = "target", outputCol = "label")
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, label_stringIdx])
```

<img src="ss/11 definition new column for HashingTF + IDF + Logistic Regression.PNG">

**CODE** : <br>
>```python
pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)
test_df = pipelineFit.transform(test_set)
train_df.show(5)
```

<img src="ss/12 Gabung new column for HashingTF + IDF + Logistic Regression.PNG">

* ### The Logistic Regression 

**CODE** : <br>
>```python
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=100)
lrModel = lr.fit(train_df)
```

<img src="ss/13 logreg fit.PNG">

* ### Predict Data Test

**CODE** : <br>
>```python
predictions = lrModel.transform(test_df)
```

<img src="ss/14 lr predict.PNG">

* ### Model Evaluation

**CODE** : <br>
>```python
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)
```

<img src="ss/15 evaluate pred lr.PNG">

**CODE** : <br>
>```python
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(test_set.count())
accuracy
```

<img src="ss/16 accuracy Lr 1.PNG">

***
# ----------------------- CountVectorizer + IDF + Logistic Regression ----------------------
***

* ### Import Library 

**CODE** : <br>
>```python
from pyspark.ml.feature import CountVectorizer
```

<img src="ss/17 library CountVectorizer + IDF + Logistic Regression.PNG">

* ### Set The Count Vectorizer & Logistic Regression

**CODE** : <br>
>```python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
cv = CountVectorizer(vocabSize=2**16, inputCol="words", outputCol='cv')
idf = IDF(inputCol='cv', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
label_stringIdx = StringIndexer(inputCol = "target", outputCol = "label")
lr = LogisticRegression(maxIter=100)
pipeline = Pipeline(stages=[tokenizer, cv, idf, label_stringIdx, lr])
```

<img src="ss/18 definition new variable for CountVectorizer + IDF + Logistic Regression.PNG">

**CODE** : <br>
>```python
pipelineFit = pipeline.fit(train_set)
```

<img src="ss/19 pipeline fit.PNG">

* ### Predict Data Test

**CODE** : <br>
>```python
predictions = pipelineFit.transform(test_set)
```

<img src="ss/20 prediction.PNG">

* ### Model Evaluation

**CODE** : <br>
>```python
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(test_set.count())
print "Accuracy Score: {0:.4f}".format(accuracy)```

<img src="ss/21 accuracy.PNG">

**CODE** : <br>
>```python
roc_auc = evaluator.evaluate(predictions)
print "ROC-AUC: {0:.4f}".format(roc_auc)
```

<img src="ss/22 ROC AUC.PNG">