# FAKE DATA GENERATION

Write a Python notebook that generates a file containing the following data:
Email addresses. Must have an "@"
Phone numbers.
Home Address.
Person's name.
Year born. Use realistic values.
Number of kids. Use realistic values.
Categorical variable: rent or own?
Annual income. Optional challenge: Use a non-uniform distribution​
Number of speeding tickets in past year. Optional challenge: Use a non-uniform distribution
The user of your notebook should be able to specify how many entities are to be generated.
Do not include the .csv output file in your submission -- the file should be generated dynamically.
Order of columns in CSV is not relevant.
See slides in lecture for tips.

In [112]:
#This cell will take care of all the imports required. Note: Executing cells in sequential manner is important for generating accurate data.

#Importing sys package which will help us install two packages, namely: names and Faker. These are not present by default and hence needs to be installed.
import sys
!{sys.executable} -m pip install names
!{sys.executable} -m pip install Faker

#Importing few more packages. Importing names, random and pandas with respective alias names. 
import names
import random as r
import pandas as pd

#No. of entities can be changed by the user. The number indicates no.of records to be generated.
no_of_entities = 20




In [113]:
#In this cell, we generate as many names as requested by the user. We make use of names library to generate names.
names_list = []

for x in range(no_of_entities):
    names_list.append(names.get_full_name())
    
print(names_list)

['Linda Whitehead', 'Diana Dietrick', 'John White', 'Gustavo Jones', 'Ronald Wright', 'Beatrice Cheng', 'Joy Pryor', 'Dawn Wallace', 'Jill Ascencio', 'William Lapinsky', 'Mary Sofia', 'Cecil Bias', 'David Craft', 'Charles Alvarado', 'Nancy Campbell', 'Paul Hunter', 'Julie Banks', 'Wesley Snipes', 'Gerald Smith', 'Beth Rains']


In [114]:
#Here we generate email ids for the names generated in the cell above. We separate the full name generated above -
# - into first and last name. And use this first and last name to generate the email id 

email_id_list = []
for name in names_list:
    names_splitted = name.split(" ")
    email = names_splitted[0].lower() + "." + names_splitted[1].lower() + "@" + r.choice(["gmail.com", "yahoo.com", "rediffmail.com"])
    email_id_list.append(email)

print(email_id_list)

['linda.whitehead@rediffmail.com', 'diana.dietrick@gmail.com', 'john.white@rediffmail.com', 'gustavo.jones@gmail.com', 'ronald.wright@rediffmail.com', 'beatrice.cheng@rediffmail.com', 'joy.pryor@yahoo.com', 'dawn.wallace@yahoo.com', 'jill.ascencio@rediffmail.com', 'william.lapinsky@gmail.com', 'mary.sofia@gmail.com', 'cecil.bias@rediffmail.com', 'david.craft@yahoo.com', 'charles.alvarado@yahoo.com', 'nancy.campbell@gmail.com', 'paul.hunter@yahoo.com', 'julie.banks@rediffmail.com', 'wesley.snipes@yahoo.com', 'gerald.smith@yahoo.com', 'beth.rains@yahoo.com']


In [115]:
# Generating phone numbers here. Following US phone number pattern to generate the same.
# Area_Code identifies the area the user belongs to. It has to be a 3-digit code.
# Prefix and Subscriber code suggests the switch code and number. They should be a 3-digit code and 4-digit code respectively.
# Together they make a phone number or line number. Also, adding country code (+1) to the beginning to make it more appropriate.

phone_numbers =[]

def generate_number():
    area_code = str(r.randint(100, 999))
    prefix_code = str(r.randint(1, 999)).zfill(3)
    subscriber_code = str(r.randint(1,9999)).zfill(4)
    
    return '+1'+ '(' + area_code+ ')'+ prefix_code+'-'+subscriber_code

for x in range(no_of_entities):
    phone_numbers.append(generate_number())

print(phone_numbers)

['+1(247)698-4032', '+1(944)880-4661', '+1(535)885-6205', '+1(495)484-2647', '+1(354)924-0206', '+1(858)220-4051', '+1(721)938-1458', '+1(899)970-5662', '+1(583)599-8022', '+1(684)248-0348', '+1(327)061-2752', '+1(249)720-8556', '+1(199)247-6055', '+1(869)408-6717', '+1(452)120-2900', '+1(869)671-5416', '+1(924)276-6296', '+1(210)478-8589', '+1(604)261-4326', '+1(565)630-1128']


In [116]:
# Generating Home Address here.
# Using Faker library to get randomly generated address and processing the results received further to make it more appropriate and readable.

from faker import Faker
fake = Faker()
address_list= []
for i in range(no_of_entities):
    address = fake.address()
    address = address.replace("\n", ", ")
    address_list.append(address)
    
print(address_list)


['65873 Chen Knolls, Ramirezfurt, WI 94134', 'Unit 1741 Box 7253, DPO AP 25003', 'USNV Porter, FPO AE 89774', '8799 Emma Parkway Suite 735, North Thomasfurt, IN 57039', 'Unit 9061 Box 4352, DPO AE 24201', '30068 David View Apt. 173, New Peggychester, ND 23718', 'PSC 3667, Box 0636, APO AE 81210', 'USS Aguilar, FPO AP 48665', '298 Johnathan Cove Apt. 402, South Jamie, MD 26932', '171 Harrison Motorway, Davidview, CO 74554', '3576 Sergio Avenue, Benjaminmouth, NE 32097', '37457 Tanya Pike Apt. 348, North Ericton, RI 21519', '3673 Peter Turnpike Suite 835, New Sandra, PA 76875', '939 Johnson Oval Suite 830, North Dennismouth, TX 80451', '645 Jennings Estates, Angelastad, NV 51726', '1231 Stephanie Lock Suite 835, North Richardland, MT 77240', '302 Parker Plains Apt. 197, East Robertstad, CO 98152', '098 Hernandez Green, New Sergiobury, MS 98277', '94102 Sims Port Suite 187, Florestown, NE 80082', '01630 Baker Crescent, Kellyborough, ND 71531']


In [117]:
# Generating year born, no.of kids, car ownership, no of tickets and annualy income data here.
# Making use of random package here to generate the data. Generating all this data in a single for loop.

year_born_list = []
no_of_kids_list = []
car_ownership_list = []
no_of_tickets_list = []
annual_income_list = []

for i in range(no_of_entities):
    year_born_list.append(r.randint(1950,2002))
    no_of_kids_list.append(r.randint(0,3))
    car_ownership_list.append(r.choice(['Rent', 'Own']))
    no_of_tickets_list.append(r.randint(0,5))
    annual_income_list.append('$ ' + str(r.randint(10000, 200000)))
    

In [118]:
# Finally, creating a dictionary and adding the generated data lists as value to the respective keys in the dictionary

dict = {'Name': names_list, 'Email_ID': email_id_list, 'Phone No.': phone_numbers, 'Address': address_list,
        'Year Born': year_born_list, 'No. Of Kids': no_of_kids_list, 'Car Ownership': car_ownership_list,
        'No. Of Tickets': no_of_tickets_list, 'Annual Income': annual_income_list}

# Using Panda to create a data frame out of the dictionary we created above.
df = pd.DataFrame(dict)
  
# Converting the data frame to a csv file and saving the file. The file will be saved in the same location as this jupyter notebook file.
df.to_csv('Not_So_Fake_Data.csv', header = True, index = False)