#Data is the foundation of data science and machine learning. Thousands upon thousands of data points are needed in order to analyze, visualize, draw conclusions, and build ML models. 

Dummy data can be useful in times where you know the exact features you’ll be using and the data types included but, you just don’t have the data itself.

#Faker is a Python package that generates fake data for you.
Faker has the ability to print/get a lot of different fake data, for instance, it can print fake name, address, email, text, etc.

In [None]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faker
  Downloading Faker-14.1.0-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 4.2 MB/s 
Installing collected packages: faker
Successfully installed faker-14.1.0


#Import Libraries

In [None]:
# Libraries
import pandas as pd
#to generate unique ids
import uuid
#to generate random numbers
import random
from faker import Faker
import datetime

# Number of rows or users to create

In [None]:

num_users = 100

#Add feature list


*   ID — a unique string of characters to identify each user.
*   Gender — string data type of three choices.

*   Subscriber — a binary True/False choice of their subscription status.
*   Name — string data type of the first and last name of the user.


*   Rating — integer type of a 1 through 5 rating of something.



In [None]:
#  A list of 5 features
features = [
    "id",
    "gender",
    "subscriber",
    "name",
    "rating"
]

# Creating a DF for these features
df = pd.DataFrame(columns=features)

df

Unnamed: 0,id,gender,subscriber,name,rating


#Creating Imbalanced Data
Some attributes above should normally contain imbalanced data. It can be safely assumed with some quick research, some choices will not be equally represented. For a more realistic dataset, these tendencies need to be reflected.

#Generate unique identifiers
UUID is a great library to generate unique IDs for each user because of its astronomically low chance of duplicating an ID.

In [None]:
df['id'] = [uuid.uuid4().hex for i in range(num_users)]
df

Unnamed: 0,id,gender,subscriber,name,rating
0,0429445d97d041228302a814b956e272,,,,
1,b826af25a2d04af99a3bd95ce71a9905,,,,
2,34b9fe9331f846118ccd0090a636b053,,,,
3,790f65cb9dd24561a16ad87c41b78b0d,,,,
4,536656d5c36c41b5804e0f6cb704b84d,,,,
...,...,...,...,...,...
95,1e727194af6a4a09b9fc781f1b589a85,,,,
96,be9998e53b1e41ac9b77ab00ea3f5745,,,,
97,c7b2f5d075c0431d94490917892482fd,,,,
98,621ad2b4ae1b4328b57acc978213dea0,,,,


# Checking if all IDs are unique

In [None]:
print(df['id'].nunique()==num_users)

True


#Generating Gender values
This attribute is one of the instances where an equally random choice should probably not be used. Because, it can be safely assumed that each choice is not equally likely to occur.

For gender, there could be three options: male, female, and na. However, if we were to use Python’s random library, then each choice might be equally shown in the dataset. In reality, there would be significantly more male and female than na choices.

In [None]:
genders = ["male", "female", "na"]


df['gender'] = random.choices(
    genders, 
    weights=(45,45,10), 
    k=num_users
)

df

Unnamed: 0,id,gender,subscriber,name,rating
0,0429445d97d041228302a814b956e272,na,,,
1,b826af25a2d04af99a3bd95ce71a9905,male,,,
2,34b9fe9331f846118ccd0090a636b053,female,,,
3,790f65cb9dd24561a16ad87c41b78b0d,male,,,
4,536656d5c36c41b5804e0f6cb704b84d,male,,,
...,...,...,...,...,...
95,1e727194af6a4a09b9fc781f1b589a85,male,,,
96,be9998e53b1e41ac9b77ab00ea3f5745,female,,,
97,c7b2f5d075c0431d94490917892482fd,male,,,
98,621ad2b4ae1b4328b57acc978213dea0,female,,,


#Generating Subscriber values
For this attribute, the choices can be randomly selected between True and False. Since it can be reasonably expected that about half of the users would be subscribers.

In [None]:
# Choices
choice = [True, False]

df['subscriber'] = random.choices(
    choice, 
    k=num_users
)

df

Unnamed: 0,id,gender,subscriber,name,rating
0,0429445d97d041228302a814b956e272,na,True,,
1,b826af25a2d04af99a3bd95ce71a9905,male,True,,
2,34b9fe9331f846118ccd0090a636b053,female,True,,
3,790f65cb9dd24561a16ad87c41b78b0d,male,True,,
4,536656d5c36c41b5804e0f6cb704b84d,male,False,,
...,...,...,...,...,...
95,1e727194af6a4a09b9fc781f1b589a85,male,True,,
96,be9998e53b1e41ac9b77ab00ea3f5745,female,False,,
97,c7b2f5d075c0431d94490917892482fd,male,False,,
98,621ad2b4ae1b4328b57acc978213dea0,female,True,,


#Generating Name values
Here we can use the Faker library to create thousands of names for all these users. The Faker library is great in this situation because it has an option for male and female names. 

In [None]:
# Instantiating faker
faker = Faker(locale="FR_FR")

def name_gen(gender):
    """
    Quickly generates a name based on gender
    """
    if gender=='male':
        return faker.name_male()
    elif gender=='female':
        return faker.name_female()
    
    return faker.name()

# Generating names for each user
df['name'] = [name_gen(i) for i in df['gender']]

df

Unnamed: 0,id,gender,subscriber,name,rating
0,0429445d97d041228302a814b956e272,na,True,Robert Lucas,
1,b826af25a2d04af99a3bd95ce71a9905,male,True,Gilbert Reynaud Le Ollivier,
2,34b9fe9331f846118ccd0090a636b053,female,True,Chantal Hamon,
3,790f65cb9dd24561a16ad87c41b78b0d,male,True,Pierre Germain,
4,536656d5c36c41b5804e0f6cb704b84d,male,False,Henri Gautier du Labbé,
...,...,...,...,...,...
95,1e727194af6a4a09b9fc781f1b589a85,male,True,Laurent Humbert,
96,be9998e53b1e41ac9b77ab00ea3f5745,female,False,Édith Jean,
97,c7b2f5d075c0431d94490917892482fd,male,False,Yves Mahe-Marion,
98,621ad2b4ae1b4328b57acc978213dea0,female,True,Pauline Hoarau-Moreno,


#Generate Ratings value
The rating of 1 to 5 represents anything and it’s just there for any discretionary purpose.

For the ratings themselves, we can chose to skew the distribution of 1 to 5 towards the extremes to reflect the tendency of users seemingly being more absolute with their ratings.

In [None]:
# The different ratings available
ratings = [1,2,3,4,5]

# Weighted ratings with a skew towards the ends
df['rating'] = random.choices(
    ratings, 
    weights=(5,10,10,10,5), 
    k=num_users
)

df

Unnamed: 0,id,gender,subscriber,name,rating
0,0429445d97d041228302a814b956e272,na,True,Robert Lucas,4
1,b826af25a2d04af99a3bd95ce71a9905,male,True,Gilbert Reynaud Le Ollivier,2
2,34b9fe9331f846118ccd0090a636b053,female,True,Chantal Hamon,3
3,790f65cb9dd24561a16ad87c41b78b0d,male,True,Pierre Germain,3
4,536656d5c36c41b5804e0f6cb704b84d,male,False,Henri Gautier du Labbé,4
...,...,...,...,...,...
95,1e727194af6a4a09b9fc781f1b589a85,male,True,Laurent Humbert,2
96,be9998e53b1e41ac9b77ab00ea3f5745,female,False,Édith Jean,1
97,c7b2f5d075c0431d94490917892482fd,male,False,Yves Mahe-Marion,4
98,621ad2b4ae1b4328b57acc978213dea0,female,True,Pauline Hoarau-Moreno,1


#Convert dataframe to dataset

In [None]:
df.to_csv('dataset.csv')

In [None]:
# Viewing the saved csv file
csv_df = pd.read_csv('dataset.csv', index_col=0)

csv_df