# THE ONE we love the most 👇😍
![Poster](https://www.gbposters.com/media/catalog/product/cache/1/image/9df78eab33525d08d6e5fb8d27136e95/f/r/friends-milkshake-maxi-poster-1.16.jpg)

In [None]:
import calendar
import pandas as pd
import plotly.express as px
from pathlib import Path
from IPython.display import display
import plotly.io as pio
import numpy as np
from collections import Counter

pio.templates.default = "ggplot2"

In [None]:
FILEPATH = Path("../input/friends-tv-show-all-seasons-and-episodes-data/friends_info.csv")

# 💾 Load Data 

In [None]:
# Load data
df = pd.read_csv(FILEPATH)
print(df.shape)
display(df.sample(6))

# 🧐 Data Cleaning

In [None]:
# Clean Episodes column
df['Season-Episode'] = df['Episode']
cleaned_column = []
for i, val in enumerate(df.Episode):
    if '\n' in val:
        val1, val2 = val.split('\n')
        val = f"{val1}{val2.split('-')[1]}"
    cleaned_column.append(val)
df.Episode = cleaned_column

# Separate Month, Day, and Year from Date column
timedf = pd.DataFrame(
    data=list(map(lambda x: list(map(int, x.split("/"))), df.Date)), 
    columns=["Month", "Day", "Year"]
)
# df.Date = pd.to_datetime(popdf.Date) # convert string to date format

# Make new columns with season number and episode number
df.Episode = df.Episode.replace({"Special": "10-100"}) # Some cleaning
sedf = pd.DataFrame(list(map(lambda x: x.split("-"), df.Episode)), columns=["Season", "Episode"])
df = df.drop('Episode', axis=1)
df = pd.concat((timedf, df, sedf), axis=1)

# Remove million from the count
df["U.S. viewers"] = df["U.S. viewers"].apply(lambda x: float(x.replace(' million', '')))
df = df.rename(columns={"U.S. viewers": "U.S. viewers (in million)"})

# 👏 <span style="color:red">The Most Popular Episode</span>

![Super](https://storage.googleapis.com/kagglesdsdata/datasets/1371292/2276776/processed.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210527%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210527T144157Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=87233d54d0835d7038bf0f0df3d300e31dc6fdb493a6e4a8a98cb388d2e17eeaf1d687253653955f98efc4f76acc80ff4dc96bc4fc2ac357bfe2ffa8369ad93e06217256cab5db9af6e605239ff236aaa656f65715218d79451087b67ca836472a8c8b0b13094aeb7722ebadc4ad03a6e1245b119674e69b246fddbc2c56289fbb5056f8451d6174aeeb1ed0f8890042ee81d9109946ebd56fce5fc4ee21b423b95b4fd0745a62f76cd70dbe743bd0771ed6b5629e1b5c227a08db73344cf37ab5052e9d951183766f2252b301a8b45900ba9243798996911a522b649208f7a228f72c690da76ca383bd4c32347641586298b660786cb5abd2d76d64d1d5690b)
[Note on image] I took screenshots from episodes and stiched these images using [this open-source automatic webapp](https://jsvirk47.pythonanywhere.com/). 🤩 Look supercool!

### **The One After the Super Bowl**: Ross goes to visit Marcel whilst on a trip to California and discovers he is working in commercials. Joey receives a fan letter from an attractive but unstable woman. Phoebe is asked to sing for children at a library.

### Season - 2
### Episode - 12/13
### Viewers - 52.9 Million

In [None]:
df[df['U.S. viewers (in million)'] == df['U.S. viewers (in million)'].max()].T

# 👋 <span style="color:orange">The Last One</span>
### Final Episode. This is the **second most popular episode** with 52.46 Million views.
![Last One](https://storage.googleapis.com/kagglesdsdata/datasets/1371292/2276836/processed%20%281%29.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210527%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210527T145420Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=336a04181d3eb8d474f02ad18a8c4cd952329326b5ccd3b940e4f66e1bdb2495cd035c9d2c19617fe99b68db419c4f5da802e6f2f4e614a046a457fdec04972eb1354ae0a8dc4e7118536def847b9b7255c953be3d075f8263c30ed5404ba8a2dd20faa69975296379527bea9931e49af2d982c92b76f8f1abcc9082f67901b684c908ab8d8db6ca98f65c0602e72407a401296ad589ea88f15d5723c5088732796665b41b0f25b5e5a1427609bf7001dbf8d86864e4a43f020a6d73b239b3302faf331c7a3ec59da6700327c46f12a7dacad64692875088fbc262ec139596871d049c30509bf320c76e509287969943738f53cd88ef2edc45f0b68da864b2aa)
[Note on image] I took screenshots from episodes and stiched these images using [this open-source automatic webapp](https://jsvirk47.pythonanywhere.com/). 🤩 Look supercool!

In [None]:
df[df['U.S. viewers (in million)'] == 52.46].T

# 📈 Popularity Trends

#### `Views`, `Rating`, and `Share` are highly correlated. They all look similar 😲
#### The most popular and second most popular episodes have high peaks.

In [None]:
px.line(df, x='Date', y='U.S. viewers (in million)', title='U.S. Viewers trend over years',
        hover_data=['Title', 'Season-Episode'])

In [None]:
df["Rating"] = df["Rating/Share"].apply(lambda x: float(x.split("/")[0]))
px.line(df, y="Rating", x="Date", title='Rating trend over years', hover_data=['Title', 'Season-Episode'])

In [None]:
df["Share"] = df["Rating/Share"].apply(lambda x: float(x.split("/")[1]))
px.line(df, y="Share", x="Date", title='Share trend over years', hover_data=['Title', 'Season-Episode'])

# 📆 Number of Episodes each Month

In [None]:
episodes_count = pd.Series(sorted(df.Month))
episodes_count = episodes_count.apply(lambda x: calendar.month_name[x])
px.histogram(
    x=episodes_count,
    labels={'x':"Months", 'y': "Number of Episodes"},
    title="Number of episodes per month",
)

# ⏱ Episode Duration
Some episodes are longer. In column `Episode`, I labeled episodes, converted `"24/25"` episode number to `2425`, just to plot the integer values. Hover over the bubbles in the scatter plot below to see the original name of episode under the field `Season-Episode`. I also included the title of the episode in hover fields.

So, Longer episodes are the ones with two parts. I think the data represents two parts as one episode.

In [None]:
epdf = df.copy().dropna().reset_index(drop=True)
px.scatter(epdf.dropna(), x='Episode', y='Duration',size="Duration", color='Duration',
           color_continuous_scale=px.colors.sequential.RdBu, hover_data=['Title', 'Season-Episode'])

# ✍🏽 Who is a good writer?

In [None]:
def writer_process(x):
    if x == 'nan': 
        return ['NONE']
    if x.startswith('Story'):
        x = x.split(":")[1].split("\n")[0]
        x = x.split("&")
        if isinstance(x, str):
            return [x.strip()]
        x = [v.strip() for v in x]
        return x
    else:
        x = x.split("&")
        if isinstance(x, str):
            return [x.strip()]
        x = [v.strip() for v in x]
        return x
writers_list = df["Written by"].astype(str).apply(writer_process)
writers_list = [w for writers in writers_list for w in writers]

In [None]:
unclean_list = ['Michael BorkowStory by : Jill Condon',
'Amy Toomin\nTeleplay by : Shana Goldberg-Meehan',
'Andrew ReichGregory S. Malins',
'Scott SilveriAndrew Reich',
'Gregory S. MalinsMarta Kauffman',
'Scott SilveriMarta Kauffman',
'Scott SilveriMarta Kauffman',
'Mike SikowitzMichael Borkow'
]
for unclean in unclean_list:
    writers_list.remove(unclean)

# Now add the cleaned names
writers_list += [
    'Michael Borkow', 'Jill Condon',
    'Amy Toomin', 'Shana Goldberg-Meehan', 
    'Andrew Reich', 'Gregory S. Malins',
    'Gregory S. Malins', 'Marta Kauffman',
    'Scott Silveri', 'Marta Kauffman',
    'Scott Silveri', 'Marta Kauffman',
    'Mike Sikowitz', 'Michael Borkow'
]
num_episodes_per_writer = Counter(writers_list)

In [None]:
px.histogram(y=writers_list, labels={'x': "Number of Episodes", 'y': 'Writer'}, height=800, title="Who wrote how many?")

In [None]:
# Get number of views for each writter
unique_writers_names = list(set(writers_list))
views_per_writer = {name:[] for name in unique_writers_names}
for i, row in enumerate(df["Written by"].astype(str).values):
    for name in unique_writers_names:
        if name in row:
            views_per_writer[name].append(df["U.S. viewers (in million)"][i])

# Sum the number of views
views_per_writer = {name:round(sum(views), 3) for name, views in views_per_writer.items()}

In [None]:
px.bar(x=views_per_writer.values(), y=views_per_writer.keys(),
       height=800,
       title="Total views (in million) each writter get in the entire series",
       labels={'x': 'Total Views (in million)', 'y': "Writer's name"},
      )

According to the data, the writer's name who has maximum number of views is:

In [None]:
max(views_per_writer, key=views_per_writer.get)

But, looking at the bar plot we can see that there is one other writer with the same number of views - `Andrew Reich`

In [None]:
print(f"Ted Cohen's total number of view = {views_per_writer['Ted Cohen']} million")
print(f"Andrew Reich's total number of view = {views_per_writer['Andrew Reich']} million")

#### So, how many episodes each of two writers wrote?

In [None]:
print(f"Ted Cohen's total episodes = {num_episodes_per_writer['Ted Cohen']}")
print(f"Andrew Reich's total episodes = {num_episodes_per_writer['Andrew Reich']}")

well... `Andrew Reich` wrote 20 episodes and has same number of views as `Ted Cohen` with 21 episodes. Therefore... 

### The answer to the question `Who is a good writer?` is **Andrew Reich**

# That's it ✌🏼

## Liked it? Upvote it 😁