# Read data

The data was downloaded from the this [kaggle dataset](https://www.kaggle.com/datasets/sahirmaharajj/crime-data-from-2020-to-present-updated-monthly).

Since the original dataset is quite large to upload on github, we will be using a sample of 1,000 rows.

In [4]:
import pandas as pd

df = pd.read_csv("input/crime_data_sample.csv")

df.head()

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,190326475,03/01/2020 12:00:00 AM,03/01/2020 12:00:00 AM,2130,7,Wilshire,784,1,510,VEHICLE - STOLEN,...,AA,Adult Arrest,510.0,998.0,,,1900 S LONGWOOD AV,,34.0375,-118.3506
1,200106753,02/09/2020 12:00:00 AM,02/08/2020 12:00:00 AM,1800,1,Central,182,1,330,BURGLARY FROM VEHICLE,...,IC,Invest Cont,330.0,998.0,,,1000 S FLOWER ST,,34.0444,-118.2628
2,200320258,11/11/2020 12:00:00 AM,11/04/2020 12:00:00 AM,1700,3,Southwest,356,1,480,BIKE - STOLEN,...,IC,Invest Cont,480.0,,,,1400 W 37TH ST,,34.021,-118.3002
3,200907217,05/10/2023 12:00:00 AM,03/10/2020 12:00:00 AM,2037,9,Van Nuys,964,1,343,SHOPLIFTING-GRAND THEFT ($950.01 & OVER),...,IC,Invest Cont,343.0,,,,14000 RIVERSIDE DR,,34.1576,-118.4387
4,220614831,08/18/2022 12:00:00 AM,08/17/2020 12:00:00 AM,1200,6,Hollywood,666,2,354,THEFT OF IDENTITY,...,IC,Invest Cont,354.0,,,,1900 TRANSIENT,,34.0944,-118.3277


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   DR_NO           1000 non-null   int64  
 1   Date Rptd       1000 non-null   object 
 2   DATE OCC        1000 non-null   object 
 3   TIME OCC        1000 non-null   int64  
 4   AREA            1000 non-null   int64  
 5   AREA NAME       1000 non-null   object 
 6   Rpt Dist No     1000 non-null   int64  
 7   Part 1-2        1000 non-null   int64  
 8   Crm Cd          1000 non-null   int64  
 9   Crm Cd Desc     1000 non-null   object 
 10  Mocodes         902 non-null    object 
 11  Vict Age        1000 non-null   int64  
 12  Vict Sex        937 non-null    object 
 13  Vict Descent    937 non-null    object 
 14  Premis Cd       1000 non-null   float64
 15  Premis Desc     1000 non-null   object 
 16  Weapon Used Cd  202 non-null    float64
 17  Weapon Desc     202 non-null    ob

In [6]:
df["AREA NAME"].value_counts()

AREA NAME
77th Street    96
N Hollywood    63
Newton         59
Southeast      54
Devonshire     53
Hollywood      51
West Valley    50
Mission        48
Southwest      47
Pacific        47
Rampart        47
Olympic        46
Wilshire       45
Harbor         40
Central        40
Topanga        40
Foothill       38
Van Nuys       37
Northeast      37
West LA        35
Hollenbeck     27
Name: count, dtype: int64

We have 21 areas in the dataset, which is quite a big number for a categorical column.

We will group this areas into fictional bigger regions.

# Grouping areas into regions

In [7]:
# Created by looking at LA map
regions = {
    "Central": ["Central","Hollywood", "Olympic", "Newton", "Wilshire", "Rampart", "Hollenbeck", "Pacific"],
    "Northeast": ["Northeast", "N Hollywood", "Foothill", "Mission"],
    "South": ["Southwest", "Southeast", "Harbor", "77th Street"],
    "Northwest": ["Van Nuys", "West Valley", "Devonshire", "West LA", "Topanga"]
}

In [8]:
def find_region(area_name):
    for region in regions.keys():
        if area_name in regions[region]:
            return region

In [9]:
df["region"] = df["AREA NAME"].apply(find_region)
df["region"].value_counts()

region
Central      362
South        237
Northwest    215
Northeast    186
Name: count, dtype: int64

The dataset was chosen for the project because it contains a good amount of rows and columns and a date columns that spans through various years.

However, for our project, we needed to create dashboards for a ficticional company, and companies need customers.

For this, we were allowed to create sythetic data.

# Generate customers dataset

In [10]:
# Let's find out the min and max date of occurance, so we keep customers creation data in line with the dataset.
print(f"Min occorance date: {df['DATE OCC'].min()}")
print(f"Max occurance date: {df['DATE OCC'].max()}")

Min occorance date: 01/01/2020 12:00:00 AM
Max occurance date: 12/31/2020 12:00:00 AM


In [11]:
from datetime import datetime

START_DATE = datetime(2020, 1, 1)
END_DATE = datetime(2020, 12, 31)

In [12]:
from faker import Faker
from typing import OrderedDict

# Initialize Faker
fake = Faker()
Faker.seed(1234)

We can create customers and customer data in a totally randomized way, but finding trends in these for the report later will be challenging as any discovery would be merely coincidence.

Let's insert some sythetic trends in this data also.

In [13]:
def generate_customer_dataset(num_rows):
    dataset = []
    for _ in range(num_rows):
        customer_id = fake.unique.random_int(min=1, max=999999)

        # Using the dates we found earlier
        registration_date = fake.date_between_dates(date_start=START_DATE, date_end=END_DATE)

        # We want to create three clusters of customers that later on can be separated by analysing the region.
        # We are going to assign customers to regions according to proportions
        region = fake.random_element(elements=OrderedDict([("North", 0.3), ("Central", 0.5), ("South", 0.2)]))  
        dataset.append((customer_id, registration_date, region))
    
    return pd.DataFrame(dataset, columns=['customer_id', 'registration_date', 'region'])


In [14]:
# Creating a dataset with 5,104 rows
df_customers = generate_customer_dataset(5104)
df_customers.head()

Unnamed: 0,customer_id,registration_date,region
0,815966,2020-06-09,North
1,955228,2020-12-10,South
2,36623,2020-09-02,North
3,803714,2020-05-09,North
4,827089,2020-01-06,Central


In [15]:
# region is not a column that exists in the original dataset.
# To join customers to the original dataset, we will use area_name, and for that we need to add that column to our customers table
def add_area_to_customer(region):
    if region == "North":
        return fake.random_element(elements=regions["Northeast"]+regions["Northwest"])
    else:
        return fake.random_element(elements=regions[region])

In [16]:
df_customers["area_name"] = df_customers["region"].apply(add_area_to_customer)
df_customers.head()

Unnamed: 0,customer_id,registration_date,region,area_name
0,815966,2020-06-09,North,Topanga
1,955228,2020-12-10,South,Southwest
2,36623,2020-09-02,North,Northeast
3,803714,2020-05-09,North,Van Nuys
4,827089,2020-01-06,Central,Hollenbeck


In [17]:
# Drop the region column as this is an implicit trend we are inserting in the data
df_customers_out = df_customers.drop('region', axis=1)

df_customers_out.to_csv('output/customers.csv', index=False)

# Generate payments and ratings datasets

For the customers events, we will create payments and ratings.

In [22]:
import numpy as np

np.random.seed(1)


def generate_rating(mean: float) -> float:
    """Generates one rating pulling from a normal distribution with a given mean and a std of 1.
    Ratings will have a floor of 1 and a ceiling of 5.

    Args:
        mean (float): Mean for the normal distribution from which the rating will be pulled.

    Returns:
        float: The rating.
    """
    rating = round(float(np.random.normal(loc=mean, scale=1, size=1)), 1)
    if rating > 5:
        return 5
    elif rating < 1:
        return 1
    else:
        return rating

In [19]:
def generate_payments_ratings_dataset(num_rows, region, mean_payment, mean_rating):
    dataset = []
    for _ in range(num_rows):

        # Get customer_id from existing customer df from the region passed as argument
        customer_id = fake.random_element(df_customers[df_customers['region']==region]['customer_id'])

        # Activity from a customer should only start after he was registered
        event_date = fake.date_between_dates(
            date_start=df_customers[df_customers['customer_id']==customer_id]['registration_date'].values[0], 
            date_end=END_DATE
        )

        payment_value = int(np.random.normal(loc=mean_payment, scale=50, size=1))
        rating = generate_rating(mean_rating)
        dataset.append((customer_id, event_date, payment_value, rating))
    
    return pd.DataFrame(dataset, columns=['customer_id', 'event_date', 'payment_value', 'rating'])

The `generate_payments_ratings_dataset` allows us to choose the customer's region, as well as the average of payment value and ratings.

We will use this to create our clusters, having one cluster with customers from the North region with lower payment value and rating, another with customers from the Central region with medium payment value and rating, a cluster with customers from the South region with high values for payments and ratings.

In [23]:
df_payments_ratings_north = generate_payments_ratings_dataset(2003, "North", 300, 2)  # Low rating cluster
df_payments_ratings_central = generate_payments_ratings_dataset(3142, "Central", 500, 3)  # Medium rating cluster
df_payments_ratings_south = generate_payments_ratings_dataset(3536, "South", 700, 4)  # High rating cluster
df_payments_ratings = pd.concat([df_payments_ratings_north, df_payments_ratings_central, df_payments_ratings_south])

df_payments_ratings.head()

  payment_value = int(np.random.normal(loc=mean_payment, scale=50, size=1))
  rating = round(float(np.random.normal(loc=mean, scale=1, size=1).reshape((1, 1))), 1)


Unnamed: 0,customer_id,event_date,payment_value,rating
0,693015,2020-11-29,381,1.4
1,892049,2020-08-23,273,1.0
2,492943,2020-09-02,343,1.0
3,286349,2020-12-01,387,1.2
4,875061,2020-11-27,315,1.8


In [21]:
df_payments_ratings.to_csv('output/payments_ratings.csv', index=False)