<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1">Motivation</a></span></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-2">Dataset</a></span><ul class="toc-item"><li><span><a href="#Dataset-statistics" data-toc-modified-id="Dataset-statistics-2.1">Dataset statistics</a></span></li></ul></li><li><span><a href="#Theory" data-toc-modified-id="Theory-3">Theory</a></span><ul class="toc-item"><li><span><a href="#Analysis-of-the-main-characters" data-toc-modified-id="Analysis-of-the-main-characters-3.1">Analysis of the main characters</a></span></li><li><span><a href="#Analysis-of-all-the-characters" data-toc-modified-id="Analysis-of-all-the-characters-3.2">Analysis of all the characters</a></span></li><li><span><a href="#Text-analysis" data-toc-modified-id="Text-analysis-3.3">Text analysis</a></span></li></ul></li><li><span><a href="#Discussion" data-toc-modified-id="Discussion-4">Discussion</a></span></li><li><span><a href="#Contributions" data-toc-modified-id="Contributions-5">Contributions</a></span></li></ul></div>

In [1]:
import numpy as np
import pandas as pd

## Motivation

The dataset is a [transcript](https://fangj.github.io/friends/?fbclid=IwAR2vf_96q737D1Q-z45fe8obSGk_iWVhCtdKAaz6hP6PA8B0W4udjwcUxAI) of all the episodes of the American TV sitcom Friends. Friends is a well know show which aired for the first time in 1994. The show is about 6 friends in their 20’s who live in New York. The group has a very dynamic relationship, which is constantly changing. Besides the main characters there’s a number of secondary characters in the show. The data is download as .CSV [files](https://github.com/shilpibhattacharyya/Friends_Analysis/tree/master/transcripts_friends).

Because the show is very well know and we have a lot of knowledge about the show, it makes it easy to compare the result. By using network analysis we can investigate our know theories about the show.


## Dataset

The .csv files with the transcripts of each episode are first combined to one dataframe using pandas that contain the following information for each spoken line: the speaker, the line, the episode, and the scene. The .csv files also contain un-relevant information which we do not include in the final dataframe.
The dataframe with all spoken lines, speaker, episode and scene is seen here:


In [8]:
url = "https://raw.githubusercontent.com/LunaHub/Friends_social_data_analysis_2019/master/data/All_Friends_data.csv"
df = pd.read_csv(url).drop("Unnamed: 0",axis=1)
df.head()

Unnamed: 0,Speaker,Text,Episode,Scene
0,monica,"oh, the way you crushed mike at ping pong was...",1001,"[scene barbados, monica and chandler's room. t..."
1,chandler,"you know, i'd love to, but i'm a little tired.",1001,"[scene barbados, monica and chandler's room. t..."
2,monica,i'll put a pillowcase over my head.,1001,"[scene barbados, monica and chandler's room. t..."
3,chandler,you're on!,1001,"[scene barbados, monica and chandler's room. t..."
4,phoebe,hey!,1001,"[scene barbados, monica and chandler's room. t..."


Furthermore a dataframe that also contains information on what characters speak and are spoken about in the scene is generated. For this the spoken lines are searched for possible names. The nltk corpus of all names are used together with the unique speakers in the dataframe. The dataframe is seen here:

In [3]:
url = "https://raw.githubusercontent.com/LunaHub/Friends_social_data_analysis_2019/master/data/Dataset_with_all_scene_characters.csv"
df_all_scene = pd.read_csv(url).drop("Unnamed: 0",axis=1)
df_all_scene.head()

Unnamed: 0,Speaker,Text,Episode,Scene,Scene_characters
0,monica,"oh, the way you crushed mike at ping pong was...",1001,"[scene barbados, monica and chandler's room. t...",['chandler' 'charlie' 'joey' 'mike' 'monica' '...
1,chandler,"you know, i'd love to, but i'm a little tired.",1001,"[scene barbados, monica and chandler's room. t...",['chandler' 'charlie' 'joey' 'mike' 'monica' '...
2,monica,i'll put a pillowcase over my head.,1001,"[scene barbados, monica and chandler's room. t...",['chandler' 'charlie' 'joey' 'mike' 'monica' '...
3,chandler,you're on!,1001,"[scene barbados, monica and chandler's room. t...",['chandler' 'charlie' 'joey' 'mike' 'monica' '...
4,phoebe,hey!,1001,"[scene barbados, monica and chandler's room. t...",['chandler' 'charlie' 'joey' 'mike' 'monica' '...


Lastly, the list of all possible names save as it will be used in some of the network analysis. This list inlcudes all possible

In [4]:
url = "https://raw.githubusercontent.com/LunaHub/Friends_social_data_analysis_2019/master/data/Dataset_all_potential_characters.csv"
df_names = pd.read_csv(url).drop("Unnamed: 0",axis=1)
df_names.head()

Unnamed: 0,0
0,adrienne
1,alan
2,alex
3,alexandra steele
4,alice


### Dataset statistics

In [28]:
#A list of all unique episodes, seperated by seasons.
l=list(np.unique(df.Episode))
seasons = []
seasons.append(l[9:10]+l[18:41])
seasons.append(l[41:64])
seasons.append(l[64:89])
seasons.append(l[89:112])
seasons.append(l[112:135])
seasons.append(l[135:158])
seasons.append(l[158:181])
seasons.append(l[181:204])
seasons.append(l[204:227])
seasons.append(l[0:9]+l[10:18])

AllEpisodes = []
for season in seasons:
    for Episode in season:
        AllEpisodes.append(Episode)

speakers=list(np.unique(df.Speaker))
scenes=list(np.unique(df.Scene.astype(str)))

print('Number of lines:    ' + str(len(df)))
print('Number of speakers: ' + str(len(speakers)))
print('Number of scenes:   ' + str(len(scenes)))
print('Number of episodes: ' + str(len(AllEpisodes)))
print('Number of seasons:  ' + str(len(seasons)))

Number of lines:    61264
Number of speakers: 907
Number of scenes:   2878
Number of episodes: 227
Number of seasons:  10


The file is 10.1 MB big and containing 61264 lines, with 907 unique speakers, 2878 unique scenes, 10 seasons and 227 episodes in total.

## Theory

By using network theory and text analysis we can investigate our theories build on our knowledge about the show. The analysis is divided into three parts, network of the main characters, network of the secondary characters, and a text analysis. 

### Analysis of the main characters

The show has 6 main characters Ross, Rachel, Phoebe, Monica and Joey. The characters relationship is constantly evolving through the show. By using network analysis we can investigate the relationship between them. This can be used to show who has the strongest bond, and how specific characters relationship evolve. The result can then be compared to knowledge about the show.

The analysis is made of the dataset including speaker, line, and scene. 

The network is created by using the library networkX in python. If two of the main characters appears in the same scene an edge between them is created, if the the edge already exists the weight of the edge is increased. This is done for each season, and for all season. Because the number of scenes is varying through the show, the weight relative to the number of scenes is calculated. The network is undirected, which means that when the network is created the same edge is added two times. By dividing it by 2 we the actually weight.

\begin{align}
W=\frac{W*0.5}{N_{scenes}}
\end{align} 

The weight of the edge now tells us how many percentage the two characters are appearing together relative the all the scenes in each episode.

When observing how the weight of the characters evolves through the season, it shows that the different weights are following each other. Thus another approach could be to compare the weight, to the sum of all the weights for each season. This can give a clearer indication of how each relationship is evolving relative to the rest of the group.

Some of the characters develops a romantic relationship through the show. This can be investigated further by finding the number of times the two characters spend time alone.

The analysis is done using this [script.](https://github.com/LunaHub/Friends_social_data_analysis_2019/blob/master/Jupyter_notebook/main_characters_network_analyse.ipynb)

The [result](https://lunahub.github.io/Friends_social_data_analysis_2019/network_analysis_main_char) is a complete undirected weighted graph. When comparing the result to knowledge about the show, its shows that the analysis is gives a clear and correct result of the network.

### Analysis of all the characters

This analysis aims at investigating the social network of all the characters in the show. The goal is to find the most important secondary characters in terms of network analysis and to explore how the most important secondary characters change over seasons. Measures like degree, centrality and modularity is used. 

This analysis uses two dataframes. The first dataframe includeds all spoken lines of all episode including the speaker, the episode, the scene and the characters in the scene. The second includes a list of all relevant speakers. 
networkX in python is the primary tool used for the analysis. A network of each season and one for all seasons in generated. The networks are generated by iterating over the spoken lines (the rows of the dataframe) and creating nodes of each speaker. The speaker is then connected by an edge to all the other characters in that scene. The edge weight is increased by one for every time a connection between two nodes exists. When plotting the networks the thickness of the edge is defined based on weight and the node size is defined based on the node’s degree. Degree is a measure of how many other nodes a node is connected to. When generating the network we do not connect the main characters to each other. This is because this part of the analysis is focused on the connections to the secondary characters. As the main characters also have very strong connections to each other it would dominate the visualization if they were included.

Some general features of the network are explored including degree distribution as seen in the figure below: 
<img src = "https://raw.githubusercontent.com/LunaHub/Friends_social_data_analysis_2019/master/figures/Degree_dist.png" width="500">
This distribution is a right-skewed Poisson distribution. A Poisson distribution is the characteristics of a random network. However, the skewing approximates a power-law distribution which is the characteristics of real scale-free networks.
The degree is also used to find the most important secondary characters of each season and in all seasons. This is illustrated in the figure below. The result fits well with the storyline of the show.
<img src = "https://raw.githubusercontent.com/LunaHub/Friends_social_data_analysis_2019/master/figures/bi_characters_degree_big_network.png">
The most important characters can also be found by using the frequency of seasons a character appears in. This analysis however, found characters like “bob” which, by knowing the story, is not an important character in the show. It is however, a name that is mentioned a lot in the show. 

It was also investigated if any communities exist for the network of all the seasons. For this, the Louvain algorithm is applied. This algorithm use modularity to define the communities and attempts to optimize the "modularity" of a partition of the network. The modularity for the communities is however 0.09 which means that the communities are weakly separated from each other.

Furthermore, the friendship paradox was investigated. The friendship paradox states that almost all of our friends will have more friends than you. This was true for most of the characters in the network of all seasons. However, it was not true for the six main characters and for the most important secondary characters.

A full ilustration of the result can be seen [here](https://lunahub.github.io/Friends_social_data_analysis_2019/network_analysis) and the script for the analysis is seen [here](https://github.com/LunaHub/Friends_social_data_analysis_2019/blob/master/Jupyter_notebook/Network_analysis.ipynb)

### Text analysis

## Discussion

## Contributions