Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: The marvel hero graph 

**Name:** Pierre Gutierrez

**Email address associated with your DataCamp account:** pierre.j.p.gutierrez@gmail.com

**Project description**: 

Networks are a paramount tool in the data scientist toolbox. Their most famous usage is, obviously, the analysis of social network interactions (Facebook, linkedin, twitter, etc.). But networks are also widely used in other domains such as telecomunication, logistics, fraud detection (network of fraudsters), biostatistics (genetics, genomics) or recommendation engines. Using network can also be a great way to visualize interactions in 2D. Here, we propose to study how to create and display network using a Marvel hero interaction dataset. 

In this project, we assume the student to have basics in python. We will start by loading and manipulating the dataset using pandas. Then, we will transition to network analysis. The student is going to create a network using the networkx package. Then he/she will learn how to display such network in a jupyter notebook. Finally, we will dive into the common network measures such as degree, centrality, betweeness,... and derive insights on the Marvel universe.  

The dataset the student will use is the Marvel social network dataset. It is freely available on the internet (typically on Kaggle as the [hero-network.csv](https://www.kaggle.com/csanhueza/the-marvel-universe-social-network#hero-network.csv)). It contains two columns: "id1" and "id2". Each line represent two heroes appearing in the same comic. This is a great dataset to learn about network analysis. First, it's big enough not to be a toy dataset. Second, because it contains real insights about how the Marvel universe is structured. Finally, it is also quite similar to a real social network, having internal clusters and a long tail distribution. To get a feeling of what we will cover, have a look at the [original blog post](https://blog.dataiku.com/2015/05/19/marvel-social-graph-analysis).  

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Loading the data


Marvel is one of the leading comics editor along with DC. The company was funded around 80 years ago and has been influencing many youngsters since then. Today, there is not a quarter without its new Marvel Blockbuster!

![alt text](./img/marvel_general.jpg)


The goal of this project is to analyse the Marvel universe: what are the most popular heroes ?  Are there different sub universe of Marvel ? How are they connected to each others ? Who are the heroes that appear in the most crossovers ? 

To do so, we are going to use network analysis techniques. Our final goal is to be able to display a network like the one bellow, where we can clearly see the star heroes, their connectection and the cluster they belong to.    


![alt text](./img/marvel_graph_1.jpg)


OK, we are ready to go! Let's start by importing the csv data in a pandas DataFrame. 

In [13]:
# Pandas is a great python library for data processing! 

import pandas as pd
marvel_data = pd.read_csv("datasets/hero_network.csv")
marvel_data.head()

Unnamed: 0,id1,id2
0,"LITTLE, ABNER",PRINCESS ZANDA
1,"LITTLE, ABNER",BLACK PANTHER/T'CHAL
2,BLACK PANTHER/T'CHAL,PRINCESS ZANDA
3,"LITTLE, ABNER",PRINCESS ZANDA
4,"LITTLE, ABNER",BLACK PANTHER/T'CHAL


## 2. Who's your favourite hero ? 

The dataset is composed of two columns "id1" and "id2".   
Each line correspond to the appearance of two heroes in the same comic in the Marvel literature.   

Now, let's get a quick look of the most represented heroes.

In [14]:
id1 = marvel_data["id1"]
id2 = marvel_data["id2"]
id_list = pd.concat([id1,id2])
counts = id_list.value_counts()
counts.head(10)

# nb: we can also have a look at the most unknown heroes using tail(). 
#counts.tail(10)

CAPTAIN AMERICA         16499
SPIDER-MAN/PETER PAR    13717
IRON MAN/TONY STARK     11817
THOR/DR. DONALD BLAK    11427
THING/BENJAMIN J. GR    10681
WOLVERINE/LOGAN         10353
HUMAN TORCH/JOHNNY S    10237
SCARLET WITCH/WANDA      9911
MR. FANTASTIC/REED R     9775
VISION                   9696
dtype: int64

## 3. Creating the networkx graph 

Great! most of these names ring a bell. Captain America, Spider Man, Iron man, Thor or the Fantastic Four where all part of movies in the last ten years. 

We are now going to view the data as a network of interactions. In Graph theory, a network is defined by a set of [vertices](https://en.wikipedia.org/wiki/Vertex_(graph_theory)) and a set of [edges](https://en.wikipedia.org/wiki/Graph_theory). Here, each vertex will be a hero, and an edge will be added if two heroes appear in the same comic.   

Let's import the data in a networkx graph and display the degree of some vertex.    

Nb: there is a difference between degree (number of connections) and number of occurences in the original dataset. This is because two heroes can appear together in several comics!

In [26]:
import networkx as nx
# beware to have your networkx version up to date. 
# previous function was from_pandas_dataframe and has a different behaviour.

nx_graph = nx.from_pandas_edgelist(marvel_data,"id1","id2")

degrees = nx_graph.degree()
degrees = pd.DataFrame(degrees)
degrees.columns = ["hero","degree"]
degrees = degrees.sort_values("degree",ascending = False)
degrees.head(10)

Unnamed: 0,hero,degree
60,CAPTAIN AMERICA,1905
48,SPIDER-MAN/PETER PAR,1737
6,IRON MAN/TONY STARK,1521
79,THING/BENJAMIN J. GR,1416
74,MR. FANTASTIC/REED R,1377
65,WOLVERINE/LOGAN,1368
183,HUMAN TORCH/JOHNNY S,1361
57,SCARLET WITCH/WANDA,1322
100,THOR/DR. DONALD BLAK,1289
84,BEAST/HENRY &HANK& P,1265


*Stop here! Only the three first tasks. :)*