# Preprocessing notebook for Task - 1: Behavior Simulation

**In this we notebook we extract the number of followers of each username.**

* To capture the potential impact of the account's popularity on tweet engagement, the Twitter account's follower count was obtained through web scraping using libraries like `requests` and `BeautifulSoup`.

## Steps followed:
1. Step-1: Extracting unique usernames from all the examples.
2. Step-2: Created a function which converts scraped follower count text to integer.
3. Step-3: Created a function which get the follower count for the given URL.
4. Step-4: We apply the functions created on the unique usernames and get the follower count.
5. Step-5: We then map the follower count of unique usernames with the original dataset.
6. Step-6: We save the final dataframe as an excel file for furthur use.




# Installing and Importing the necessary libraries

In [None]:
!pip install pandas
!pip install requests
!pip install beautifulsoup4
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Loading the training dataset to a dataframe
The dataset is loaded into a pandas dataframe to allow us to smoothy preprocess and extract data.

 **Replace 'task1_train_dataset.xlsx' with the appropriate file path of the training dataset.**

In [None]:
df=pd.read_excel('task1_train_dataset.xlsx')

# Extracting the Unique Usernames

In this cell, we extract the unique usernames from all the examples.
* We create a new DataFrame: `df_unique_usernames` which contains all the unique usernames.
* we used the method `.unique()` to get all the unique usernames.

In [None]:
df_unique_usernames = pd.DataFrame({'username': df['username'].unique()})

# Function to convert the number of followers text to an integer

In this code cell, we define a function which convert the followers text to integer.

**Input:**
1. followers_text: The extracted from web scraping, type: `str`

**Returns**
Follower count, type: `int`
For Example:
* 1.8k = 1800
* 1.8m = 1800000

**Code Explaination:**
* We used `.replace` method to replace 'k' or 'm' with blank('')
* We converted the type to `float` and mulliplied with appropriate values for 'k' or 'm'.
* After getting the final count, we converted it to type - `int` and then returned the count.



In [None]:
def convert_followers_to_int(followers_text):
    followers_text = followers_text.lower()
    if 'k' in followers_text:
        return int(float(followers_text.replace('k', '')) * 1000)
    elif 'm' in followers_text:
        return int(float(followers_text.replace('m', '')) * 1000000)
    else:
        return int(followers_text)

# Function to extract follower count
In this funtion we extract the follower count using the username.

**Inputs:**
1. Username: The username to get follower count.

**Returns:**
Follower count, type: `int`

**Code Explaination:**
* We edit the url and add the username given in input
* We create a get request using the `.get()` method of `requests` library.
* We create an `BeautifulSoup` object using the response from the request.
* We use BeautifulSoup methods like `.find` , `.find_all()` along` to find the html elements and then we use `.strip()` and `.replace()` methods to accurately extract the follower count text from it.
* We get the follower count by using the function we created in the last section - `convert_followers_count()`

**Website used for web-scrapping twitter followers count: https://hypeauditor.com/**


In [None]:
def extract_followers_count(username):
    url = f'https://hypeauditor.com/twitter/{username}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    metrics_div = soup.find('div', {'class': 'metrics flex'})

    if metrics_div:
        metric_divs = metrics_div.find_all('div', {'class': 'metric'})

        for div in metric_divs:
            metric_title = div.find('div', {'class': 'metric-title'})
            if metric_title and metric_title.text.strip() == 'Followers':
                followers_count = div.text.replace('Followers', '').strip()
                return convert_followers_to_int(followers_count)

    return None

# Using the above functions to fetch the follower count of each unique username

In this code block, we use the pandas `.apply` method to create a new column in the unique usernames dataframe with the followers count in it.


In [None]:
df_unique_usernames['followers'] = df_unique_usernames['username'].apply(lambda x: extract_followers_count(x) if pd.notnull(x) else None)

# A new column "followers" is added to the original dataframe
- A dictionary is created which maps every unique username to the follower count.
- This mapping is used to create a new "followers" column in the original dataframe.

In [None]:
username_followers_mapping = df_unique_usernames.set_index('username')['followers'].to_dict()
df['followers'] = df['username'].map(username_followers_mapping)

# Converting the dataframe back to an excel sheet
The dataframe is then converted to an excel file for furthur use.

In [None]:
df.to_excel('PS_8_dataset_with_followers.xlsx')