![friends-tv-show-1542126105](https://user-images.githubusercontent.com/66208179/120125041-aa98ff80-c1bf-11eb-9e56-6b1990a1f720.jpeg)

Friends is one my favorite tv shows ❤️. I remember crying about the final for two days 😅 (it was so touching to see every character grow throughout the show ✨).

In this project, I will be getting a bit nostalgic and will try to understand given data through Exploratory Data Analysis. I will also see if we can predict something (at this point, anything). It can be sentiment analysis based on summaries and titles, or classification for the number of episode contributers.

# Data

We have the following columns:
    
- `Date`: the date of release for the episode

- `Episode`: episode number (season - episode)

- `Title`: episode title

- `Directed by`

- `Written by`

- `Duration` 
- `Summary`

- `Rating/Share`

- `U.S. viewers`: how many people viewed the episode in the US

- `Prod. code`: unique values


# Process:

1. ✅Tweak the data a bit to understand the data in detail.
2. ✅Try to understand any type of correlation.
3. ✅If there is - model!

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Understanding Data

In [None]:
data = pd.read_csv("../input/friends-tv-show-all-seasons-and-episodes-data/friends_info.csv")
data.head()

In [None]:
data.info()

In [None]:
data.describe()

I will apply the following transformations:
    
- `Date`: get the date as year, day and month solely and add them as new columns

- `Episode`: keep the episode column, but also create two other columns for episode and season separetely

- `Title`: keep it like it is for EDA, but since this a categorical column we will need to transform it before modeling

- `Directed by`: keep it like it is for EDA, but since this a categorical column we will need to transform it before modeling

- `Written by`: add a column `written_by_numbers` to keep track of how many people have written the episode

- `Duration`: keep it like it is

- `Summary`:  keep it like it is for EDA, but since this a categorical column we will need to transform it before modeling

- `Rating/Share`: make the division and keep that value

- `U.S. viewers`: change to float
- `Prod. code`: can be removed

In [None]:
data.Episode

In [None]:
# Date column

data["release_year"] = pd.DatetimeIndex(data['Date']).year
data["release_month"] = pd.DatetimeIndex(data['Date']).month
data["release_day"] = pd.DatetimeIndex(data['Date']).day
data.drop("Date", axis = 1, inplace = True)

In [None]:
data.Episode

🧐 As we can see, there are some special columns that do not follow the **episode-season** format. So I will create an exception to make sure we are keeping track of different formats and classify them as missing.

In [None]:
seasons = []
episodes = []

for i in data.Episode:
    try:
        seasons.append(i.split("-")[0])
    except:
        seasons.append("Special") 
    try:
        episodes.append(i.split("-")[1])
    except:
        episodes.append("Special")

data["episode_number"] = episodes
data["season_number"] = seasons
data["episode-season"] = data["Episode"]
data.drop("Episode", axis = 1, inplace = True)

🌱 `Written by` column will **depend on the number of `&` characters** since it is the format the dataframe follows. If there is no `&`, we have one writer; if there is one `&`, there are 2 writers and so on.

In [None]:
# written by column

written_number = [] 
for i in data["Written by"]:
    written_number.append(str(i).count('&') + 1)
    
data["writtenby_number"] = written_number
data["writtenby_number"].value_counts()

In [None]:
# rating/ share column
rating_score = []
for i in data["Rating/Share"]:
    rating_score.append(float(i.split("/")[0]) / float(i.split("/")[1]))

data["rating"] = rating_score
data.drop("Rating/Share", axis = 1, inplace=True)

In [None]:
# remove the prod column
data.drop("Prod.\ncode", axis = 1, inplace=True)

In [None]:
# US viewers change to float
for i in range(len(data["U.S. viewers"])):
    data["U.S. viewers"].iloc[i] = float(data["U.S. viewers"].iloc[i].replace(' million', ''))


In [None]:
data["U.S. viewers"]

In [None]:
# rename some column with spaces

data.columns = data.columns.str.replace(' ', '_')
data.columns = data.columns.str.replace('.', '')

In [None]:
data.head()

# Data Visualization

🤯 At this point, we are done with tweaking the features (of course, we can always go back and iterate through different methods again). I will know **visualize** some of the columns to better understand the relation.

I am currently reading **[The Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)**, so I will try to apply what I've learned in here. It is important to apply what you've recently learnt and it is one of things I really love about Kaggle: you can apply what you've learned in books or courses here ⭐️

**Things that Caught My Eye During `Understanding Data` part:**
1. views vs. number of writers
2. views vs. date (month, day, year)
3. views and writers (who wrote it)
4. views and directors
5. us viewers and ratings


and whatever catches my attention during visualization :)



On eof my favourite libraries is **Pandas Profiling**. You can check [this notebook](https://www.kaggle.com/dtomruk/commonlit-eda-modeling) to learn more about it.

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(data)

Feel free to discover what Pandas Profiling showed us. 

Here are some of my takes:

🌎**Correlation Matrix**: There isn't any correlation between numerical values.

<img width="663" alt="Screen Shot 2021-05-31 at 3 03 21 PM" src="https://user-images.githubusercontent.com/66208179/120232412-3496a600-c25c-11eb-9a81-4f086cf12f38.png">

🌈**The directors** that appaeared the most:

<img width="410" alt="Screen Shot 2021-05-31 at 3 01 30 PM" src="https://user-images.githubusercontent.com/66208179/120232414-35c7d300-c25c-11eb-90f2-cf52ab800652.png">

👁The **views** in the US:

<img width="556" alt="Screen Shot 2021-05-31 at 3 02 08 PM" src="https://user-images.githubusercontent.com/66208179/120232415-36606980-c25c-11eb-97e5-4df4e05ca2e4.png">

In [None]:
# views vs ratings

sns.relplot(x="US_viewers", y="rating", data=data);

In [None]:
sns.heatmap(data.corr());

In [None]:
plt.hist(data.US_viewers);

I will also check if a character that is mentioned in the summary (such as Rachel or Monica) is somehow related to ratings or views.

In [None]:
rachel = []
monica = []
ross = []
chandler = []
joey = []
phoebe = []

for i in range(len(data)):
    # there are still nan values (possibly i could've filled them with "missing" value, too)
    try:
        if "Rachel" in data.Summary.iloc[i]:
            rachel.append((data.US_viewers.iloc[i], data.rating.iloc[i]))
        if "Monica" in data.Summary.iloc[i]:
            monica.append((data.US_viewers.iloc[i],data.rating.iloc[i]))
        if "Ross" in data.Summary.iloc[i]:
            ross.append((data.US_viewers.iloc[i], data.rating.iloc[i]))
        if "Chandler" in data.Summary.iloc[i]:
            chandler.append((data.US_viewers.iloc[i], data.rating.iloc[i]))
        if "Joey" in data.Summary.iloc[i]:
            joey.append((data.US_viewers.iloc[i], data.rating.iloc[i]))
        if "Phoebe" in data.Summary.iloc[i]:
            phoebe.append((data.US_viewers.iloc[i], data.rating.iloc[i]))
    except:
        pass

Check [*this tutorial*](https://www.tutorialspoint.com/matplotlib/matplotlib_pie_chart.htm#:~:text=Matplotlib%20API%20has%20a%20pie,array%20will%20not%20be%20normalized.) for pie charts.

In [None]:
# viewers
rachel_sum = sum([r[0] for r in rachel])
monica_sum = sum([r[0] for r in monica])
ross_sum = sum([r[0] for r in ross])
phoebe_sum = sum([r[0] for r in phoebe])
chandler_sum = sum([r[0] for r in chandler])
joey_sum = sum([r[0] for r in joey])



fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
names = ['Rachel', 'Monica', 'Ross', 'Phoebe', 'Chandler', 'Joey']
list_sum = [rachel_sum, monica_sum, ross_sum, phoebe_sum, chandler_sum, joey_sum]
ax.pie(list_sum, labels = names,autopct='%1.2f%%')
plt.show()

There doesn't seem to be a huge difference when it comes to viewers.

In [None]:
# ratings

rachel_sum = sum([r[1] for r in rachel])
monica_sum = sum([r[1] for r in monica])
ross_sum = sum([r[1] for r in ross])
phoebe_sum = sum([r[1] for r in phoebe])
chandler_sum = sum([r[1] for r in chandler])
joey_sum = sum([r[1] for r in joey])



fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
names = ['Rachel', 'Monica', 'Ross', 'Phoebe', 'Chandler', 'Joey']
list_sum = [rachel_sum, monica_sum, ross_sum, phoebe_sum, chandler_sum, joey_sum]
ax.pie(list_sum, labels = names,autopct='%1.2f%%')
plt.show()

Not a huge difference on ratings either. Rachel seems to attract more viewers and ratings though! 🧚

The 👑queen👑  deserves her own column then :)

In [None]:
x = []

for i in data.Summary:
    try:
        x.append(i.count("Rachel"))
    except:
        x.append(0)
        
data["rachel_count"] = x

# Define the Problem

## Can we predict the ratings of episodes based on other columns?

The approach will be based on time-series. Our model can learn from previous episodes and predict the rating of future episodes. This is a regression problem since the target is ratings, a continous value.

Before getting into modeling, we need to do several things:
- categorical to numerical transformations
- train test split 
- fill missing values
- scaling if needed

In [None]:
data.shape

## Handling Categorical Data

In [None]:
data.info()

Few things first:
- `episode_number`, `season_number` and`US viewers` can be expressed as a float. Let's fix that.
- we can drop `episode-season` since we already have those in separate columns.

In [None]:
data.drop("episode-season", axis = 1, inplace = True)

In [None]:
for i in range(len(data.episode_number)):
    try:
        if data.episode_number.iloc[i] == "Special":
            data.episode_number.iloc[i] = 0
        else:
            data.episode_number.iloc[i] = int(i)
    except:
        data.episode_number.iloc[i] = int(i.str.replace('\n', ''))


In [None]:
for i in range(len(data.season_number)):
    try:
        if data.season_number.iloc[i] == "Special":
            data.season_number.iloc[i] = 0
        data.season_number.iloc[i] = int(data.season_number.iloc[i])
    except:
        data.season_number.iloc[i] = int(data.season_number.iloc[i].str.replace('\n', ''))


In [None]:
data.drop("Summary", axis = 1, inplace = True)

In [None]:
data.season_number = pd.to_numeric(data.season_number)
data.episode_number = pd.to_numeric(data.episode_number)
data.US_viewers = pd.to_numeric(data.US_viewers)

In [None]:
cat_cols = list(set(data.columns) - set(data._get_numeric_data().columns))
cat_cols

In [None]:
dummies = pd.get_dummies(data[cat_cols])
data = data.drop(cat_cols, axis = 1)
data = pd.concat([data, dummies], axis = 1)

In [None]:
data.head()

## Train Test Split

It is important to split our data before filling the missing values to prevent data snooping.

i will keep the test size as 0.2.

In [None]:
len(data) * 0.8

In [None]:
# note that the data is already in ascending order in terms of release date
train = data[:183]
test = data[183:]

## Fill Missing Values

### Train

In [None]:
train.isna().sum()

In [None]:
train["Duration"].fillna(train.Duration.mean(), inplace = True) 

### Test

In [None]:
test.isna().sum().sum()
test["Duration"].fillna(test.Duration.mean(), inplace = True) 

In [None]:
X_train, y_train = train.drop("rating", axis = 1), train["rating"]
X_test, y_test = test.drop("rating", axis = 1), test["rating"]