Task 1 – Import required libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random
import seaborn as sns

Task 2 – Generate random data for the social media data

In [None]:
periods_long = 500

categories = ["Food", "Travel", "Fashion", "Fitness", "Music", "Culture", "Family", "Health"]

data = {
    "Date": pd.date_range(start='2023-01-01', periods=periods_long),
    "Category": [random.choice(categories) for _ in range(periods_long)],
    "Likes": np.random.randint(0, 10000, size=periods_long),
}

Task 3 – Load the data into a Pandas DataFrame and Explore the data

In [None]:
data = pd.DataFrame(data)
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data["Category"].value_counts()

Task 4 – Clean the data

In [None]:
# check for missing values
data.isnull().sum()

In [None]:
# check for duplicates
data.duplicated().sum()

In [None]:
data.info()

In [None]:
data['Date'] = pd.to_datetime(data['Date'])
data['Likes'] = data['Likes'].astype(int)

In [None]:
data.info()

In [None]:
data.head()

Task 5– Visualize and Analyze the data

In [None]:
# histogram of likes
plt.figure(figsize=(12, 6))
sns.histplot(data['Likes'], kde=True)
plt.title('Distribution of Likes')
plt.xlabel('Likes')
plt.ylabel('Frequency')
plt.show()

In [None]:
# boxplot of likes and categories
plt.figure(figsize=(12, 6))
sns.boxplot(x='Category', y='Likes', data=data)
plt.title('Likes by Category')
plt.xlabel('Category')
plt.ylabel('Likes')
plt.show()

In [None]:
data['Likes'].mean()

In [None]:
# mean likes of each category
data.groupby('Category')['Likes'].mean()

Describe Conclusion

In conclusion, this jupyter notebook successfully creates a data analysis pipeline for a hypothetical social media dataset. It begins by importing necessary libraries, then generates random data for the dataset. The data is then loaded into a pandas DataFrame for exploration and cleaning. The script checks for missing values and duplicates, and ensures the data types are correct for each column.

The data is then visualized using histograms and boxplots, providing insights into the distribution of 'Likes' and the relationship between 'Category' and 'Likes'. The script also calculates the mean 'Likes' for the entire dataset and for each category.

This script provides a solid foundation for any further analysis or machine learning tasks. Future work could include more sophisticated data cleaning and preprocessing steps, more in-depth exploratory data analysis, and the application of machine learning algorithms to predict 'Likes' based on 'Category' or other features.