# Introduction

In this section, we will use the <a href="../Data/users.csv">users.csv</a> and <a href="../Data/posts.csv">posts.csv</a> datasets while trying to answer the following question:
<b>What are the posts we have to include in the feed of each user?</b>

## Process

This task will involve taking the interests of each user and recommend videos whose category align with one of the user's interests, the steps for this process will be:

<ul>
<li>Import the required libraries (Pandas and ast) and create the <code>users</code> and <code>posts</code> DataFrames</li>
<li>Clean the DataFramees</li>
<li>Generate the <code>Suggested_posts</code> DataFrame</li>
</ul>

# Importing and Initializing DataFrames

In [1]:
import ast
import pandas as pd


users = pd.read_csv("../Data/users.csv")
posts = pd.read_csv("../Data/posts.csv")

We will also take a quick look at each DataFrame

In [2]:
users.head()

Unnamed: 0,username,location,interests,how_did_you_hear_about_us
0,sray,New York,"['movies', 'music']",tv ad
1,margaret96,Los Angeles,"['tech', 'movies', 'gaming']",google search
2,descobar,New York,"['movies', 'gaming']",tv ad
3,oconnorcaroline,Phoenix,['sports'],online ad
4,katherinebrown,Houston,['gaming'],app store


In [3]:
posts.head()

Unnamed: 0,post_title,category,number_of_likes,number_of_comments,date,time
0,Future-proofed regional frame,,297,1,2024-01-01,00:00:00
1,Open-architected well-modulated budgetary mana...,sports,257,570,2024-01-01,06:00:00
2,Centralized next generation toolset,music,251,47,2024-01-01,12:00:00
3,Fully-configurable homogeneous architecture,gaming,17,98,2024-01-01,18:00:00
4,Centralized asynchronous application,,62,2,2024-01-02,00:00:00


# Data cleaning

Before cleaning the data, we will need to check if there are any null values

In [4]:
users.isna().value_counts()

username  location  interests  how_did_you_hear_about_us
False     False     False      False                        47
                               True                          3
Name: count, dtype: int64

In [5]:
posts.isna().value_counts()

post_title  category  number_of_likes  number_of_comments  date   time 
False       False     False            False               False  False    245
            True      False            False               False  False      5
Name: count, dtype: int64

In the ```users``` DataFrame, there are only 3 null values in thr ```how_did_you_hear_about_us``` column, which is a column that we will not focus on in this section, however, we see that the ```category``` column in the ```posts``` DataFrame contains 5 null values, we will fix this by dropping them, we will need to make sure that we drop only the rows that contain null values in the ```category``` column only

In [6]:
posts = posts.dropna(subset='category')
posts = posts.reset_index(drop=True)

Now, let us take a look at a sample of the ```users``` DataFrame

In [7]:
users.sample()

Unnamed: 0,username,location,interests,how_did_you_hear_about_us
49,stanleynicole,Chicago,"['music', 'movies']",social media


We notice that the ```interests``` column is treated as a string instead of a list, and since we will need to work in each element of it later, we will need to convert it to a list first using ```ast.literal_eval()``` function which will convert a list-like string to a python list

In [8]:
users['interests'] = users['interests'].apply(lambda x: ast.literal_eval(x))

Now after seeing another sample we find that the ```interests``` column now contains python lists instead

In [9]:
users.sample()

Unnamed: 0,username,location,interests,how_did_you_hear_about_us
22,bmejia,Houston,"[tech, movies, fashion]",online ad


# Generate the ```suggested_posts``` DataFrame

We start by defining the ```suggest()``` function 

In [10]:
def suggest(index, users_df, posts_df): #takes in the index of a users, the users DataFrame, and the posts DataFrame
    result = []
    interests = users_df.loc[index, "interests"]  
    for j in range(len(posts_df)):#j starts from index 0 until the last element in the posts DataFrame
        if posts_df.loc[j, "category"] in interests: 
            result.append(posts_df.loc[j, "post_title"]) #append the title of the post whose category exists in the user's interest list
    return result


Now that we are done, we will create a copy of the original ```users``` DataFrame and generate the ```suggested_posts``` DataFrame

In [11]:
users_and_posts = users.copy() #take a copy of the oriignal DataFrame
users_and_posts["suggested_posts"] = [[] for _ in range(users_and_posts.shape[0])] #Initialize a new column named suggested_posts
for i in range(users_and_posts.shape[0]):
    users_and_posts.at[i, "suggested_posts"] = suggest(i, users, posts) #The colum value will be equal to the list output by the suggest() function for each index
    
users_and_posts = users_and_posts.loc[:, ["username", "suggested_posts"]] #make the new DataFrame contain the username and te suggested_posts columns only

# Final result

We can simply take a look at our new DataFrame

In [12]:
users_and_posts.head()

Unnamed: 0,username,suggested_posts
0,sray,"[Centralized next generation toolset, Grass-ro..."
1,margaret96,"[Fully-configurable homogeneous architecture, ..."
2,descobar,"[Fully-configurable homogeneous architecture, ..."
3,oconnorcaroline,[Open-architected well-modulated budgetary man...
4,katherinebrown,"[Fully-configurable homogeneous architecture, ..."


# Try it yourself

Run the below cell to (Press ```ctrtl``` + ```Enter```) to be able to enter in a username and find all the suggested_posts (please use a username that already exists in the DataFrame such as <b>descobar</b>)

<h3><b>Please uncomment the cell by removing the #'s before running it</b></h3>

In [13]:
#username = input("enter name: ")
#users_and_posts[users_and_posts.username == username].suggested_posts.tolist()

# Conclusion

In this section, we were able to retrieve the posts we need to suggests to users based on each user's interest, we have explored how we can clean and prepare data and how we can create a python function that find matches between 2 DataFrames

# Next up

In the next section, we will try to determine what ads to display in each location based on the ad category and how common it is in each location, if you want to check it, please click <a href="2-Ads_per_location.ipynb">here</a>