`FirstMilestone.py` gave us 6000 films as `movies_data.csv`. <br>
we need to preprocess it to further usage

### Part-1 creating 'adult' feature
in order to prevent recomending adult films when seeking for "Finding Nemo" or other non adult films <br>
we created 'adult' feature out of ```"keywords", "description"``` <br>

**Steps:**
1. Loaded `movies_data.csv` dataset
2. defined `adult_keywords`, that will help identifyng adult movies
3. created method that returns 0(non-adult) or 1(adult) if adult_keywords have been found in text
4. used it in ```"keywords", "description"``` and saved as 0 or 1
5. saved in `adultAdded.csv`

In [None]:
import pandas as pd

# 1.Load the dataset
movies_file = 'movies_data.csv'
data = pd.read_csv(movies_file)

# 2. creating list of adult words
adult_keywords = [
    "sex", "sexual", "lgbtq", "prostitute", "erotic", "nudity", "nude", "porn", "xxx", "explicit", 
    "sensual", "intimate", "seduction", "fetish", "kinky", "provocative", "raunchy", "sultry", "taboo", "voyeur",
    "bdsm","stripper", "escort", "affair", "orgy", "brothel", 
    "lust", "vulgar", "softcore", "hardcore", "molestation", "incest", "rape", 
    "pedophile", "perversion", "depraved", "lewd", 
    "pornographic", "adult movie", "taboo themes", 
    "indecent", "exposed", "uncensored", "provocation", "seductive", "temptation", 
    "flirtatious", "voyeurism", "depravity", "lecherous", "forbidden", "sensational", 
    "obscenities", "bare", "exposed body", "indulgent"
]
# 3. method to check for adult words
def check_adult_content(text):
    if pd.isnull(text):
        return 0
    text = text.lower()
    for keyword in adult_keywords:
        if keyword in text:
            return 1
    return 0


# 4.'adult' created then checked both 'description' and 'keywords' columns
#                                                 TRUE = TRUE OR False
data['adult'] = data['description'].apply(check_adult_content) | data['keywords'].apply(check_adult_content)
# Convert to INT(0 or 1)
data['adult'] = data['adult'].astype(int)

# 5.save to csv
data.to_csv('adultAdded.csv', index=False)

print("IT WORKS, file saved here --> 'adultAdded.csv'.")

The 'adult' feature has been successfully added and saved to 'movies_data_with_adult_feature.csv'.


### PART 2 further preprocessing

**Steps:**
1. Loaded `"adultAdded.csv"`
2. selected features that we will use ```'title', 'genres', 'user_rating', 'keywords', 'director', 'adult'```
3. Dropped movies that don't have `title` or `user_rating`, because they are essential for our model
4. `'genres', 'keywords', 'director', 'adult'` that contains *null* have been filled as "Unknown" or 0(for *adult* feature)
5. right now `user_rating` in range 0 till 10, we normalized it so now it in range [0:1]
6. one hot encode `genres`. It will take substeps to explain<br>
---
6.1 Transform genres like that 'Action, Adventure' -> ['Action', 'Adventure']<br>
so basicly divide them
```
data['genres'] = data['genres'].str.split(', ')
```
6.2 
```python
genres_one_hot = data['genres'].explode().str.get_dummies().groupby(level=0).sum()
```
So if we will break 6.2 down 
- **Input** (`data['genres']`):
  ```
  0    ['Action', 'Adventure']
  1    ['Drama']
  2    ['Action', 'Comedy']
  ```

- (after `explode()`):
  ```
  0    Action
  0    Adventure
  1    Drama
  2    Action
  2    Comedy
  ```

(after `.str.get_dummies()`):
  ```
       Action  Adventure  Comedy  Drama
  0         1          0       0      0
  0         0          1       0      0
  1         0          0       0      1
  2         1          0       0      0
  2         0          0       1      0
  ```

`.groupby(level=0).sum()`**
- **Output** (after `.groupby(level=0).sum()`):
  ```
       Action  Adventure  Comedy  Drama
  0         1          1       0      0
  1         0          0       0      1
  2         1          0       1      0
  ```

6.3 `data = pd.concat([data, genres_one_hot], axis=1)`**
Combines newly created one-hot genres into our dataframe 

---
7. Removing duplicate `title`'s 
8. And finally saving it in `"cleaned_movies.csv"`


In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# 1.loading our dataset
file_path = "adultAdded.csv"  
movies = pd.read_csv(file_path)

# 2. selecting features we need
columns_to_use = ['title', 'genres', 'user_rating', 'keywords', 'director', 'adult']
data = movies[columns_to_use]


# 3. dropping null rows
data = data.dropna(subset=['title', 'user_rating'])

# 4. Filling null values
data['genres'] = data['genres'].fillna('Unknown')
data['keywords'] = data['keywords'].fillna('Unknown')
data['director'] = data['director'].fillna('Unknown')
data['adult'] = data['adult'].fillna(0)  # Filling nulls with 0(non-adult movie)

# 5.normalizing user_rating
scaler = MinMaxScaler()
data['user_rating'] = scaler.fit_transform(data[['user_rating']])

# 6. Onehot encoding genres
data['genres'] = data['genres'].str.split(', ')
genres_one_hot = data['genres'].explode().str.get_dummies().groupby(level=0).sum()
data = pd.concat([data, genres_one_hot], axis=1)

# 7. dropping duplicates
data = data.drop_duplicates(subset='title', keep='first')

# 8.Save cleaned dataset
output_file = "cleaned_movies.csv"
data.to_csv(output_file, index=False)

print(f"Cleaned dataset saved to {output_file}")


Cleaned dataset saved to cleaned_movies.csv
