<a href="https://colab.research.google.com/github/SSSpock/skillspire/blob/main/casestudy_2ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faker
  Downloading Faker-18.4.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-18.4.0


In [3]:
## Users DF

import numpy as np
import pandas as pd
from faker import Faker

# Initialize the Faker library
fake = Faker()

# Define the number of unique users
num_users = 100000

# Generate user_id
user_id = list(range(1, num_users + 1))

# Generate age
median_age = 25
ages = np.random.normal(loc=median_age, scale=5, size=num_users).astype(int)

# Generate gender
gender_ratio = [0.48, 0.5, 0.01, 0.01]  # Male, Female, Non-Binary, Prefer not to say
gender = np.random.choice(["Male", "Female", "Non-Binary", "Prefer not to say"], size=num_users, p=gender_ratio)

# Generate country
countries = ["United States", "United Kingdom", "France", "Germany", "Spain", "Italy", "Netherlands", "Belgium", "Sweden", "Norway", "Denmark", "Finland", "Switzerland", "Ireland", "Austria", "Portugal", "Greece"]
country = np.random.choice(countries, size=num_users)

# Generate signup_date
signup_dates = [fake.date_between(start_date='-3y', end_date='today') for _ in range(num_users)]

# Generate subscription_type
subscription_ratio = [0.6, 0.3, 0.1]  # Free, Basic, Premium
subscription_type = np.random.choice(["Free", "Basic", "Premium"], size=num_users, p=subscription_ratio)

# Create the Users DataFrame
users_df = pd.DataFrame({"user_id": user_id, "age": ages, "gender": gender, "country": country, "signup_date": signup_dates, "subscription_type": subscription_type})

# Save the DataFrame to a CSV file
users_df.to_csv("user_data.csv", index=False)


In [14]:
# Content df

# Define the number of content items
num_content = 5000

# Generate content_id
content_id = list(range(1, num_content + 1))

# Generate content_type
content_type_ratio = [0.6, 0.35, 0.05]  # Movie, TV Show, Live Event
content_type = np.random.choice(["Movie", "TV Show", "Live Event"], size=num_content, p=content_type_ratio)

# Generate genre
genres = ["Action", "Comedy", "Drama", "Documentary", "Thriller", "Horror", "Sci-Fi", "Romance", "Animation", "Crime", "Family", "Adventure", "Fantasy", "Mystery", "Biography", "History", "Sport", "Music", "War", "Western"]
genre = np.random.choice(genres, size=num_content)

# Generate release_year
release_years = np.random.randint(low=1990, high=2023, size=num_content)

# Generate duration_minutes
duration_minutes = np.random.randint(low=30, high=240, size=num_content)

# Generate average_user_rating
average_user_rating = np.random.uniform(low=1, high=5, size=num_content).round(1)

# Create the Content DataFrame
content_df = pd.DataFrame({"content_id": content_id, "content_type": content_type, "genre": genre, "release_year": release_years, "duration_minutes": duration_minutes, "average_user_rating": average_user_rating})

# Save the DataFrame to a CSV file
content_df.to_csv("content_data.csv", index=False)


In [16]:
# User-content Interactions

# Define the number of interactions
num_interactions = 50000

# Generate user_id for interactions
interaction_user_id = np.random.choice(user_id, size=num_interactions)

# Generate content_id for interactions
interaction_content_id = np.random.choice(content_id, size=num_interactions)

# Create a temporary Interactions DataFrame
temp_interactions_df = pd.DataFrame({"user_id": interaction_user_id, "content_id": interaction_content_id})

# Merge the Users DataFrame with the temporary Interactions DataFrame on user_id
merged_df = temp_interactions_df.merge(users_df[["user_id", "signup_date"]], on="user_id")

# Generate interaction_timestamp ensuring it's after the user's signup_date
merged_df["interaction_timestamp"] = merged_df["signup_date"].apply(lambda x: fake.date_time_between(start_date=x, end_date='now', tzinfo=None))

# Generate interaction_type
interaction_types = ["Played", "Liked", "Disliked", "Added to Playlist", "Shared"]
interaction_type = np.random.choice(interaction_types, size=num_interactions)

# Add interaction_type to the merged DataFrame
merged_df["interaction_type"] = interaction_type

# Create the final User-Content Interactions DataFrame
interactions_df = merged_df[["user_id", "content_id", "interaction_type", "interaction_timestamp"]]

# Save the DataFrame to a CSV file
interactions_df.to_csv("interactions_data.csv", index=False)


In [19]:
# Revenue DF

# Sample a subset of interactions as revenue-generating events
revenue_interactions_df = interactions_df.sample(frac=0.4, random_state=42)

# Generate revenue_amount using a log-normal distribution
mu, sigma = 0, 1
revenue_amount = np.random.lognormal(mean=mu, sigma=sigma, size=len(revenue_interactions_df))
revenue_amount = np.round(revenue_amount * 10, 2)  # Scale and round the values

# Add revenue_amount to the revenue_interactions_df
revenue_interactions_df["revenue_amount"] = revenue_amount

# Generate revenue_type based on content_type
content_type_revenue_ratio = {
    "Movie": [0.3, 0.6, 0.1],
    "TV Show": [0.4, 0.5, 0.1],
    "Live Event": [0.2, 0.3, 0.5]
}

def generate_revenue_type(row):
    content_type = content_df.loc[content_df["content_id"] == row["content_id"], "content_type"].values[0]
    return np.random.choice(["Subscription", "Advertisement", "In-app Purchase"], p=content_type_revenue_ratio[content_type])

revenue_interactions_df["revenue_type"] = revenue_interactions_df.apply(generate_revenue_type, axis=1)

# Generate transaction_date based on interaction_timestamp
revenue_interactions_df["transaction_date"] = revenue_interactions_df["interaction_timestamp"].dt.date

# Create the Revenue DataFrame
revenue_df = revenue_interactions_df[["user_id", "content_id", "revenue_amount", "revenue_type", "transaction_date"]]

# Save the DataFrame to a CSV file
revenue_df.to_csv("revenue_data.csv", index=False)


In [22]:
# Define the number of A/B test interactions
num_ab_test_interactions = 10000

# Generate user_id for A/B test interactions
ab_test_user_id = np.random.choice(user_id, size=num_ab_test_interactions)

# Generate content_id for A/B test interactions
ab_test_content_id = np.random.choice(content_id, size=num_ab_test_interactions)

# Generate group (Control or Test)
group = np.random.choice(["Control", "Test"], size=num_ab_test_interactions)

# Create a temporary A/B Test Interactions DataFrame
temp_ab_test_interactions_df = pd.DataFrame({"user_id": ab_test_user_id, "content_id": ab_test_content_id, "group": group})

# Merge the Users DataFrame with the temporary A/B Test Interactions DataFrame on user_id
merged_ab_test_df = temp_ab_test_interactions_df.merge(users_df[["user_id", "signup_date"]], on="user_id")

# Generate interaction_timestamp ensuring it's after the user's signup_date
merged_ab_test_df["interaction_timestamp"] = merged_ab_test_df["signup_date"].apply(lambda x: fake.date_time_between(start_date=x, end_date='now', tzinfo=None))

# Generate interaction_type
ab_test_interaction_types = ["Played", "Liked", "Disliked"]
ab_test_interaction_type = np.random.choice(ab_test_interaction_types, size=num_ab_test_interactions)

# Add interaction_type to the merged A/B Test DataFrame
merged_ab_test_df["interaction_type"] = ab_test_interaction_type

# Create the final A/B Test Interactions DataFrame
ab_test_interactions_df = merged_ab_test_df[["user_id", "content_id", "group", "interaction_type", "interaction_timestamp"]]

# Save the DataFrame to a CSV file
ab_test_interactions_df.to_csv("ab_test_data.csv", index=False)


# Case Study: Tech Startup "StreamVision"

# Backstory:

StreamVision is a technology startup that has recently developed a cutting-edge streaming platform for movies, TV shows, and live events. The company is growing rapidly and has acquired a substantial user base in a short period. The management team is interested in using data-driven decision-making to optimize the platform's user experience and maximize revenue.

As a data scientist at StreamVision, you have been tasked with analyzing the platform's usage data to identify opportunities for improvement and make data-backed recommendations to the management team.

## Student Tasks:

##Data Wrangling and Analysis (Week 4)

Use Python and Pandas to clean and preprocess the raw usage data.
Perform exploratory data analysis to identify trends and patterns in user behavior.

## Statistical Analysis (Week 5)

Apply descriptive statistics to summarize the key features of the usage data.
Conduct hypothesis testing to validate assumptions and answer key questions posed by the management team.

## Experimental Design (Week 6)

Design an A/B test to evaluate the impact of potential changes to the platform (e.g., new recommendation algorithms or user interface adjustments).
Analyze the results of the A/B test and determine if the changes had a statistically significant impact on user engagement or revenue.

## Data Visualization (Week 7)

Use Matplotlib and Seaborn to create visualizations that effectively communicate the findings from the data analysis.
Present the visualizations in a way that is easily understood by the management team.

## Regression Analysis (Week 8)

Build and evaluate linear regression models to predict user engagement and revenue based on different platform features.
Use model selection and regularization techniques to improve the performance of the regression models.
Questions from Executives:

What are the main factors driving user engagement and revenue on the platform?
Are there any specific user segments or content categories that are underperforming or overperforming compared to the rest?
Can we optimize the recommendation algorithm to increase user engagement and revenue? What changes would you propose?
Are there any statistically significant differences in user behavior based on demographic factors or user preferences?
How effective are the proposed platform changes based on the A/B test results? Should we implement the changes for all users?

## Users Table (user_data.csv):

user_id (integer): Unique identifier for each user.
age (integer): Age of the user.
gender (string): Gender of the user (Male, Female, Non-Binary, Prefer not to say).
country (string): Country of residence of the user.
signup_date (date): Date when the user signed up for the platform.
subscription_type (string): Type of subscription (Free, Basic, Premium).

## Content Table (content_data.csv):

content_id (integer): Unique identifier for each content item.
content_type (string): Type of content (Movie, TV Show, Live Event).
genre (string): Genre of the content (e.g., Action, Comedy, Drama, Documentary, etc.).
release_year (integer): Year when the content was released.
duration_minutes (integer): Duration of the content in minutes (only applicable for Movies and TV Shows).
average_user_rating (float): Average rating given by users for the content, ranging from 1 (lowest) to 5 (highest).

## User-Content Interactions Table (interactions_data.csv):

user_id (integer): Unique identifier for the user who interacted with the content.
content_id (integer): Unique identifier for the content that was interacted with.
interaction_type (string): Type of interaction (Played, Liked, Disliked, Added to Playlist, Shared).
interaction_timestamp (datetime): Timestamp when the interaction occurred.

## Revenue Table (revenue_data.csv):

user_id (integer): Unique identifier for the user who generated the revenue.
content_id (integer): Unique identifier for the content that generated the revenue.
revenue_amount (float): Amount of revenue generated (in USD).
revenue_type (string): Type of revenue (Subscription, Advertisement, In-app Purchase).
transaction_date (date): Date when the revenue was generated.


In [None]:
users_df

In [None]:
revenue_df

In [None]:
interactions_df

In [None]:
content_df

In [None]:
ab_test_interactions_df