# Big Data

## Python version : 3
## Libraries used
* <b>Pandas</b>- Used for storing and managing data for better analysis.
* <b>Numpy</b> Used for numeric arrangements of data.
 

## Datasets Used

* <b>users_table.csv</b> :- Contains data about the users of a social net.
  * <b>Surname</b> :- Surname of the user.
  * <b>Name</b> :-  Name of the user.
  * <b>Age</b> :- Current age of user.
  * <b>Subscription Date</b> :- Date when the user subscribed to the social network.
* <b>posts_table.csv</b>
  * <b>User</b> :- Users who posted.
  * <b>Post Type</b> :- Type of posts, eg:-Image, Text and Gif.
  * <b>Post Date</b> :- Date recorded when they posted.
##
* <b>reactions_table.csv</b>
  * <b>User</b> :- Users who have reacted to any post.
  * <b>Reaction Type</b> :- Type of reaction, eg:-Like, Emoticon and Comment.
  * <b>Reaction Date</b> :- Date recorded when the reacted.
##
* <b>friends_table.csv</b>
  * <b>Friend 1</b> :- Friend 1 is field which indicates users who have friends.
  * <b>Friend 2</b> :- Friend 2 is a field of users who is friend to Friend 1.

## Importing necessary libraries

In [29]:
#import libraries
import pandas as pd
import numpy as np

## Reading data

In [33]:
#Reading data to pandas dataframe
user_table = pd.read_csv('user_table.csv')
posts_table = pd.read_csv('posts_table.csv')
reactions_table = pd.read_csv('reactions_table.csv')
friends_table = pd.read_csv('friends_table.csv')

# Task 1 :- 
## What is the most common name in the social network? How many people share it?


Creating a column which contains FullName of the user will be helpful to find details related to name. Create a column named **FullName** using **Surname** and **Name** in users_table


In [37]:
# Assigning Full Name to a variable.
fullname = user_table['Name']+" "+user_table['Surname']
# Creating and storing full name to user table.
if 'FullName' not in user_table.columns:
  user_table.insert(loc=0,column='FullName',value=fullname)

### The data after adding FullName will be..

In [39]:
print(user_table.head(12))

             FullName      Surname      Name  Age  Subscription Date
0         Sarah Smith        Smith     Sarah   30         1588157373
1     Francine Picard       Picard  Francine   32         1588161732
2           Hans Roth         Roth      Hans   40         1588157337
3           Ali Pomme        Pomme       Ali   28         1588165636
4      Jordi Di Lillo     Di Lillo     Jordi   42         1588156042
5           Anna Roth         Roth      Anna   26         1588162689
6          Jordi Kirk         Kirk     Jordi   56         1588153009
7   Josie Beierlorzer  Beierlorzer     Josie   20         1588166376
8       Robert Picard       Picard    Robert   39         1588158173
9      Jean-Luc Meier        Meier  Jean-Luc   37         1588156009
10         Josie Kirk         Kirk     Josie   31         1588166811
11   Sarah Wellington   Wellington     Sarah   40         1588160408


## Listing most common names and its count.

* **How many names mentioned in user table?**
* **What are the most common names?**

In [43]:
# Bringing all full name to variable name
names = user_table.FullName
# Printing number of total names
print("---------------------------------------------------")
print("Total names appeard among Users :- ",len(names.value_counts()))
print("----------------------------------------------------")
# Printing most mentioned names.
print(names.value_counts().head(15))
print("----------------------------------------------------")

---------------------------------------------------
Total names appeard among Users :-  246
----------------------------------------------------
Josie Bond              11
Thomas Smith            10
Thomas Meier            10
Sarah Bond              10
Timothy Picard          10
Simon Mueller            8
Sarah Smith              8
Franz Mueller            8
Jean-Luc Thronton        8
Jean-Luc Beierlorzer     8
Ali Mueller              8
Jordi Kirk               8
Lee Smith                7
Agaba Bond               7
Lee Kirk                 7
Name: FullName, dtype: int64
----------------------------------------------------


## Which name is shared by most of the users?

The most frequent name can be calculated using the **mode** function of pandas. Mode is used to find the most frequent value from a set

In [48]:
# Finding most repeated full name among all users
most_frequent_name = user_table.FullName.mode()
print("Most common name among users :",most_frequent_name)


Most common name among users : 0    Josie Bond
dtype: object


## How many users share the name "Josie Bond"?
This can be achieved by finding maximum of name count

In [51]:
# Finding maximum mentioned name's count
count = user_table.FullName.value_counts().max()
print("Number of users with the name Josie Bond :",count)

Number of users with the name Josie Bond : 11


# Task 2 :- 

## List five people with most posts and reaction combined

#### Steps
* Create id column to user table.
* Join posts and reactions table.
* Calculate count of posts by users.
* Calculate count of reactions by users.
* Combine both reaction and posts count.
* Calculate total count of posts and reactions by users from combined table.
* Sort result in descending order.
* List the users with most posts and reactions

In [55]:
# Create 'User' column to store id. id is calculated using numpy
user_id = np.arange(len(user_table))
if 'User' not in user_table.columns:
  user_table.insert(loc=0,column='User',value=user_id)

### User table after adding id will be

In [56]:
print(user_table.head(10))

   User           FullName      Surname      Name  Age  Subscription Date
0     1        Sarah Smith        Smith     Sarah   30         1588157373
1     2    Francine Picard       Picard  Francine   32         1588161732
2     3          Hans Roth         Roth      Hans   40         1588157337
3     4          Ali Pomme        Pomme       Ali   28         1588165636
4     5     Jordi Di Lillo     Di Lillo     Jordi   42         1588156042
5     6          Anna Roth         Roth      Anna   26         1588162689
6     7         Jordi Kirk         Kirk     Jordi   56         1588153009
7     8  Josie Beierlorzer  Beierlorzer     Josie   20         1588166376
8     9      Robert Picard       Picard    Robert   39         1588158173
9    10     Jean-Luc Meier        Meier  Jean-Luc   37         1588156009


### Joining posts table and reactions_table with users table to measure the number of posts and reactions

In [66]:
#joining user_table and posts_table as user_posts
user_posts = user_table.merge(posts_table,on='User')
#joining user_table and reactions_table as user_reactions
user_reactions = user_table.merge(reactions_table,on='User')

### Grouping each user in 'user_posts' to find total number of posts made by user

In [72]:
#Grouping user post table based on User(Id) to count posts by user
post_count = user_posts.groupby('User')['Post Type'].agg(['count'])
post_count.reset_index(inplace=True)
#Rename count with PostCount
post_count.rename(columns = {'count':'PostCount'},inplace=True)

### Grouping by user in 'user_reactions' to find total number of reactions made by user

In [75]:
#Grouping user reactions table based on User(Id) to count reactions by user
reaction_count = user_reactions.groupby('User')['Reaction Type'].agg(['count'])
reaction_count.reset_index(inplace=True)
#Rename count with ReactionCount
reaction_count.rename(columns = {'count':'ReactionCount'},inplace=True)

### Combine both reaction count and post count

In [82]:
# Merging both posts and reactions count
combined_count =  pd.merge(post_count,reaction_count,how='left',on='User')
# Clearing empt values with 0
combined_count = combined_count.fillna(0)


### Calculate sum of post's counts and reaction's counts of each user

In [88]:
combined_count['TotalCount'] = combined_count['PostCount']+combined_count['ReactionCount']

## Sort from high to low to find top five users with most posts and reactions combined

In [94]:
# Sorting based on TotalCount
combined_count = combined_count.sort_values(by = 'TotalCount',ascending=False)
# Read only top 5 data
top_counted_users = combined_count.head(5)

In [95]:
print(top_counted_users)

     User  PostCount  ReactionCount  TotalCount
641   642         24          217.0       241.0
663   664         24          130.0       154.0
66     67         21          130.0       151.0
677   678         20          130.0       150.0
652   653          9          139.0       148.0


### Fetch details of top 5 users with most reactions and posts combined.

In [106]:
top_counted_users = user_table.merge(top_counted_users[['User','TotalCount']],on='User')
top_counted_users = top_counted_users.sort_values(by = 'TotalCount',ascending=False)

## Users who have most posts and reactions?

In [113]:
print(top_counted_users)

   User      FullName  Surname     Name  Age  Subscription Date  TotalCount
1   642   Ali Mueller  Mueller      Ali    5         1588145188       241.0
3   664    Zoe Picard   Picard      Zoe   46         1588160876       154.0
0    67  Agaba Gwahsi   Gwahsi    Agaba   51         1588155646       151.0
4   678  Andreas Kirk     Kirk  Andreas   28         1588155836       150.0
2   653     Alok Kirk     Kirk     Alok   18         1588158745       148.0


# Task 3
## Create a plot of the friend ship graph for all users named "Jean-Luc Picard"(upto 2nd degree)

### Steps
* Fetch all users who named "Jean-Luc Picard".
* Join 1st and 2nd degree friends of users named "Jean-Luc Picard".
* Draw a network diagram between each friends.

## Finding all users named "Jean-Luc Picard".

In [117]:
# Assign all "Jean-Luc Picard" users to variable
jlp = user_table[user_table['FullName'].str.lower()=='jean-luc picard']

## How many people share the name "Jean-Luc Picard"?

In [119]:
# Finding and printing length of users
number_of_users = len(jlp)
print(number_of_users)


0


## There is no user called "Jean-Luc Picard" in the data set. So friendship graph cannot be plotted.