# Amazon reviews data transformation
> **Before you start**: 
>
> Download this file, uncompress it and place the files `train.csv` and `test.csv` at the root of the `MachineLearning-SageMaker-Challenge` directory. 
>
> **[m.serverless.link/now](https://m.serverless.link/now)**


This notebook will help you transfor the original [Amazon Reviews](https://course.fast.ai/datasets) dataset to be a BlazingText format, for example:

```
__label__YOUR_LABEL_NAME I've used spincast reels for over 40 years...
```

Where `YOUR_LABEL_NAME` is the label that you want to apply to that review e.g. `positive`, `negative`, `neutral`, `other`, etc...

In [12]:
val sourceName = "train"
val dfr = spark.read
  .option("escape", "\"")
  .format("csv")
  .load(s"/home/jovyan/$sourceName.csv")

sourceName = train
dfr = [_c0: string, _c1: string ... 1 more field]


lastException: Throwable = null


[_c0: string, _c1: string ... 1 more field]

### Optional
The following block displays information about the data that was loaded:

In [13]:
dfr.printSchema()
dfr.show()
dfr.count()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)

+---+--------------------+--------------------+
|_c0|                 _c1|                 _c2|
+---+--------------------+--------------------+
|  3|  more like funchuck|Gave this to my d...|
|  5|           Inspiring|I hope a lot of p...|
|  5|The best soundtra...|I'm reading a lot...|
|  4|    Chrono Cross OST|The music of Yasu...|
|  5| Too good to be true|Probably the grea...|
|  5|There's a reason ...|There's a reason ...|
|  1|        Buyer beware|This is a self-pu...|
|  4|Errors, but great...|I was a dissapoin...|
|  1|          The Worst!|A complete waste ...|
|  1|           Oh please|I guess you have ...|
|  1|Awful beyond belief!|I feel I have to ...|
|  4|A romantic zen ba...|When you hear fol...|
|  5|Lower leg comfort...|Excellent stockin...|
|  3|Delivery was very...|It took almost 3 ...|
|  2|sizes recomended ...|sizes are much sm...|
|  3|            Overbury

3000000

## This is a temp class to load the review data

In [15]:
case class Sentiment(text: String, sentiment: String) {
    var _sentiment = sentiment;
    
    override def toString(): String = {
        s"__label__$sentiment $text"
    }
}

defined class Sentiment


## TODO: 
### transfor each row of the data to `Sentiment` and clean up using the `filter` method

In [16]:
val dfr2 = dfr.map(r => {
    val stars: String = r.getString(0)
    val title: String = r.getString(1) // You may want to do something with this?
    val text: String = r.getString(2) 
    
    // In scala, the last expression in each statement is taken to be the return value.
    // The following if/else statement should create and return a Sentiment object.
    if(stars == "1" || stars == "2") {
        Sentiment(text, "negative")
    } else if(stars == "4" || stars == "5") {
        Sentiment(text, "positive")
    } else if(stars == "3") {
        Sentiment(text, "neutral")
    } else {
        Sentiment(text, "null")
    }
})
.filter(r => r._sentiment != "null") // <===== TODO: REMOVE ANY DATA YOU MAY NOT WANT IN YOUR FINAL RESULTS

dfr2 = [text: string, sentiment: string]


[text: string, sentiment: string]

### Optional
Let's check the total number of records after they were filtered

In [17]:
dfr2.count()

3000000

## Save the transformed data

In [18]:
dfr2
    .map(r => r.toString())
    .coalesce(1)
    .write.text(s"/home/jovyan/transformed-$sourceName")

## Rename and move file
The output will be located in the `transformed-SOURCE_NAME` directory. Rename the file to `train.txt` or `test.txt` and upload it to the MLData bucket created in **Level 00** .

## Re-run the process
**This time** change the value of the variable `sourceName` from `train` to `test` 