# 7.1 : Find Influencers on Twitter


### Overview
Find 'influencers' on Twitter graph

### Depends On
None

### Run time
20 mins

### Lab Setup

This lab uses the GraphFrames spark package.  It is currently NOT part of the default spark framework.

We use this for two reasons:

1. GraphFrames likely will be the basis of future graph processing in spark.

2. RDD-based graphx has no python API, whereas GraphFrames does have a python API. <br>As this jupyter notebook is python that makes it required.

So, to run this we have two choices

**option 1 : Jupyter**

**Note:** 
Jupyter lab will be already running on the port 8888.
So, kill the process first.
```bash
$ sudo netstat -plnt | grep 8888
The process id will be shown in the output.Replace process id in the kill command

$ sudo kill -9 process id
```

```bash
$ PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ~/apps/spark/bin/pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
```

**option 2 : PySpark (command line)**
```bash
$ PYSPARK_PYTHON=python3  ~/apps/spark/bin/pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
```

Replace the version on graphframes with your latest.

With that said, let's start:

## Step 1: Build the following twitter graph
Here is some real world data:
<img src="../assets/images/7.1a.png" style="border: 5px solid grey; max-width:100%;"/>

We are using data from a real Twitter account, if you want, you can use yours

#### Import the necessary libraries

In [None]:
# %AddDeps graphframes graphframes 0.7.0-spark2.4-s_2.11
# %lsmagic

**NOTE** Only execute the following if you are running on notebook

In [None]:
# initialize Spark Session
## ONLY execute this if you are running within jupyter notebook 
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
sc = spark.sparkContext

#### Construct the array of vertices

 Data structure: twitter handle, number of followers, gender of the tweeter

In [None]:
vertices = spark.createDataFrame([
        (1, "@markkerzner", 309, "M"),  # (Id, Name, no of followers, gender)
        (2, "@mjbrender", 3101, "M"),
        (3, "@dridisahar1", 27, "F"),
        (4, "@dez_blanchfield ", 38600, "M"),
        (5, "@ch_doig ", 519, "F"),
        (6, "@Sunitha_Packt ", 332, "F"),
        (7, "@WibiData ", 2477, "N")  # Name here is company, so gender is neutral
], ["id", "Name", "followers", "gender"]) 

vertices.show()

####  Construct the array of edges

On this step, these are all my followers, so they connect to me

In [None]:
edges = spark.createDataFrame([
        (1, 2, 7), # (src, dest, retweets)
        (1, 3, 2),
        (1, 4, 4),
        (1, 5, 3),
        (1, 6, 1),
        (1, 7, 2)
], ["src", "dst", "retweets"])

edges.show()

#### Construct the graph from the vertices and edges

In [None]:
from graphframes import GraphFrame

graph = GraphFrame(vertices, edges)

## Step 2 : Analyzing Graph
#### Print graph

In [None]:
# Vertices
graph.vertices.show()

In [None]:
# Edges
graph.edges.show()

In [None]:
# triplets
graph.triplets.show(truncate=False)

## Step 3 : Query the graph

#### Filter out male followers

In [None]:
graph.vertices.filter("gender != 'M'").show()

#### Find my significant followers

In [None]:
graph.vertices.filter("followers > 1000").show() 

#### Find those followers who do enough re-tweeets for me

In [None]:
graph.edges.filter("retweets > 5").show()

graph.edges.filter("retweets > 5").count()

#### Count my male and female followers

In [None]:
num_male = graph.vertices.filter("gender == 'M'").count()
num_female = graph.vertices.filter("gender == 'F'").count()
print ('Males %d, Females %d' % (num_male, num_female))