#### ---------------------------------------------------------------------------------------------------------------
# Generating Train and Test Data 
### How likely is a person to register in SSY Scheme
#### ---------------------------------------------------------------------------------------------------------------
#### Description:
This program will be generating simulation of Train and Test Data on how likely a person from a given occupation, of a specific age group, from a specific location etc. (we can increase / change these parameters later as and when required) is to register in SSY Scheme. This will work on existing data and give the probbilities of registering in the scheme.

The machine learning code that will be written next, is supposed to learn the inherent pattern of this probability of registration depending on the attributes occupation, age group and specific location etc.

Further analytics and chorpleth BI reporting can be generated from this dataset.

# Initialization Block
### Assign global constants, read locations & create empty rows for each person

In [38]:
# import libraries
import numpy as np
import pandas as pd
import csv
import random
import string
from random import *

# Global values and constants
# min_ppl_in_Subdiv & max_ppl_in_Subdiv can be altered to get more data.
# Population in a subdivition is a random number in between. Increasing these parameters may increase the processing time
min_char_in_name = 4
max_char_in_name = 6
min_ppl_in_Subdiv = 3
max_ppl_in_Subdiv = 4
# probability and randomness factors must be soft-coded here as a best practice.
# This has been traded of here for the sake of quick results.
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$ These are the factors the machine learning code has to learn and predict on this basis $$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

# open and read input csv
csv_path = "SubdivisionList.csv"
df_subdivisions = pd.read_csv(csv_path)

# define intermediate dataframes
df_location_pref = pd.DataFrame(columns = ['DistCode', 'District', 'Subdivision', 'LocationLikelyness'])
df_persons = pd.DataFrame(columns = ['DistCode', 'District', 'Subdivision', 'PersonName', 'Gender', 'Age', 'WorkerType',
                                        'WorkerTypeLikelyness', 'AgeLikelyness', 'GenderLikelyness', 'LocationLikelyness',
                                        'Likelyness', 'RegisteredYesNo', 'Registered2017', 'Registered2016'])

# Populate intermediate data
### Inject parameter based likelyness to register

In [39]:
# Location bases likelyness to register
for index, row in df_subdivisions.iterrows():
    
    df_location_pref = df_location_pref.append(df_subdivisions.iloc[[index]], ignore_index=True)
    
    if index%4 == 0:
        # for 25% of the data, Location bases likelyness to register will be high
        df_location_pref.LocationLikelyness.iloc[[index]] = 0.8 + 0.2 * np.random.uniform(0, 1)
    elif index%4 == 1:
        # for 25% of the data, Location bases likelyness to register will be medium
        df_location_pref.LocationLikelyness.iloc[[index]] = 0.6 + 0.4 * np.random.uniform(0, 1)
    elif index%4 == 2:
        # for 25% of the data, Location bases likelyness to register will be low
        df_location_pref.LocationLikelyness.iloc[[index]] = 0.4 + 0.6 * np.random.uniform(0, 1)
    else:
        # for 25% of the data, Location bases likelyness to register will be very low
        df_location_pref.LocationLikelyness.iloc[[index]] = 0.2 + 0.8 * np.random.uniform(0, 1)

### Create row for each individual

In [40]:
# Create row for each individual under all locations
for index, row in df_location_pref.iterrows():
    for i in range(np.random.randint(min_ppl_in_Subdiv, max_ppl_in_Subdiv)):
        df_persons = df_persons.append(df_location_pref.iloc[[index]], ignore_index=True)

### Populate Name, Gender, Occupation and Age of each person

In [41]:
# Populate intermediate dataframe
for index, row in df_persons.iterrows():
    
    #Populate name of the person with 8 to 12 random characters
    allchar = string.ascii_letters
    df_persons.PersonName.iloc[[index]] = "".join(choice(allchar) for x in range(randint(min_char_in_name, max_char_in_name)))
    
    # Randomly populate Gendre with a probability of male or female = 50%
    gender_rand = np.random.randint(2)
    if gender_rand == 0:
        df_persons.Gender.iloc[[index]] = "Male"
        # for male persons to register there is a fixed probability of 40% and an uniform dist randomness of 60%
        df_persons.GenderLikelyness.iloc[[index]] = (0.4 + 0.6 * np.random.uniform(0, 1))
    else:
        df_persons.Gender.iloc[[index]] = "Female"
        # for female persons to register there is a fixed probability of 60% and an uniform dist randomness of 40%
        df_persons.GenderLikelyness.iloc[[index]] = (0.6 + 0.4 * np.random.uniform(0, 1))

    # Age of the entire population is a triangular distributio with low as 18, high as 67 and median at 30
    age = round(np.random.triangular(18, 30, 67))
    df_persons.Age.iloc[[index]] = age

    if (age>=18 and age<=25):
        # for persons up to 26 yrs age, to register, there is a fixed probability of 30% and an uniform dist randomness of 70%
        df_persons.AgeLikelyness.iloc[[index]] = (0.3 + 0.7 * np.random.uniform(0, 1))
    elif (age>25 and age<=40):
        # for persons of 25 to 39 yrs age, to register, there is a fixed probability of 70% and an uniform dist randomness of 30%
        df_persons.AgeLikelyness.iloc[[index]] = (0.7 + 0.3 * np.random.uniform(0, 1))
    elif (age>40 and age<=67):
        # for persons of 25 to 39 yrs age, to register, there is a fixed probability of 70% and an uniform dist randomness of 30%
        df_persons.AgeLikelyness.iloc[[index]] = (0.5 + 0.5 * np.random.uniform(0, 1))
    else:
        df_persons.AgeLikelyness.iloc[[index]] = 0

    # Assign occupation and occupation based likelyhood of each individual to join SSY
    occupn_rand = np.random.randn()
    if occupn_rand > 0.75:
        df_persons.WorkerType.iloc[[index]] = 'Transport'
        df_persons.WorkerTypeLikelyness.iloc[[index]] = (0.7 + 0.3 * np.random.uniform(0, 1))

    elif occupn_rand < -0.75:
        df_persons.WorkerType.iloc[[index]] = 'Construction'
        df_persons.WorkerTypeLikelyness.iloc[[index]] = (0.6 + 0.4 * np.random.uniform(0, 1))
    else:
        df_persons.WorkerType.iloc[[index]] = 'Other'
        df_persons.WorkerTypeLikelyness.iloc[[index]] = (0.4 + 0.6 * np.random.uniform(0, 1))


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [42]:
for index, row in df_persons.iterrows():
    
    # final likelyhood to register will be a product of location based, age related, occupation based and gender based likelyhoods
    df_persons.Likelyness.iloc[[index]] = df_persons.LocationLikelyness.iloc[[index]] * df_persons.AgeLikelyness.iloc[[index]] * df_persons.GenderLikelyness.iloc[[index]] * df_persons.WorkerTypeLikelyness.iloc[[index]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


# Calculate final likelyhood of registering
### The AI has to learn this pattern through the training cycles
### Also find if the person was registered in the previous years

In [45]:
for index, row in df_persons.iterrows():
    
    age = df_persons.Age.iloc[index]
    
    
    # Year - 2018: If final likelyhood is > 25%, we consider the person will be registered for 2018
    if df_persons.Likelyness.iloc[index] > 0.25:
        df_persons.RegisteredYesNo.iloc[index] = 'Yes'
    else:
        df_persons.RegisteredYesNo.iloc[index] = 'No'

    # Prune retired age
    if age > 65:
        df_persons.RegisteredYesNo.iloc[index] = 'No'
        
        
    # Year - 2017: 75% likelyhood of all persons currently registered were registered in 2017
    if df_persons.RegisteredYesNo.iloc[index] == 'Yes':
        if (np.random.uniform(0, 1)) < 0.75:
            df_persons.Registered2017.iloc[index] = 'Yes'
        else:
            df_persons.Registered2017.iloc[index] = 'No'
    # 5% likelyhood of all persons currently not registered were registered in 2017
    else:
        if (np.random.uniform(0, 1)) > 0.05:
            df_persons.Registered2017.iloc[index] = 'No'
        else:
            df_persons.Registered2017.iloc[index] = 'Yes'            

    # Prune retired age and under age
    if age>66 and age<19:
        df_persons.Registered2017.iloc[index] = 'No'


    # Year - 2016: 75% likelyhood of all persons registered in 2017 were registered in 2016
    if df_persons.Registered2017.iloc[index] == 'Yes':
        if (np.random.uniform(0, 1)) < 0.75:
            df_persons.Registered2016.iloc[index] = 'Yes'
        else:
            df_persons.Registered2016.iloc[index] = 'No'
    # 5% likelyhood of all persons currently not registered were registered in 2017
    else:
        if (np.random.uniform(0, 1)) > 0.05:
            df_persons.Registered2016.iloc[index] = 'No'
        else:
            df_persons.Registered2016.iloc[index] = 'Yes'            

    # Prune under age
    if age<20:
        df_persons.Registered2016.iloc[index] = 'No'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


# Publish final dataset
## Print data output in csv file

In [46]:
# create final dataset
df_final = df_persons[['DistCode', 'District', 'Subdivision', 'PersonName', 'Gender', 'Age', 'WorkerType', 'RegisteredYesNo',
                          'Registered2017', 'Registered2016']]

In [47]:
df_final.head()

Unnamed: 0,DistCode,District,Subdivision,PersonName,Gender,Age,WorkerType,RegisteredYesNo,Registered2017,Registered2016
0,AD,Alipurduar,Alipurduar,rDFa,Male,35,Other,Yes,Yes,Yes
1,AD,Alipurduar,Alipurduar,hbbfH,Male,49,Construction,Yes,Yes,Yes
2,AD,Alipurduar,Alipurduar,vXgEfn,Female,50,Other,Yes,Yes,Yes
3,BN,Bankura,Bankura Sadar,VtXVYH,Male,43,Construction,Yes,Yes,No
4,BN,Bankura,Bankura Sadar,airxsT,Female,45,Transport,Yes,No,No


In [None]:
# publish final dataset into output csv file
df_final.to_csv('SubdivisionPersons.csv')
 
print("writing complete!")

#### ---------------------------------------------------------------------------------------------------------------
#### || FluxionBits | SSY Analytics | 14-Aug-2018 | Anirban Chakrabarty ||
#### ---------------------------------------------------------------------------------------------------------------
#### Disclaimer:
This computer program is proprietary and confidential to FluxionBits.com. This is being used for marketing purposes to the Dept. of Planning, West Bengal Govt. only. Any illegal usage of this data or program in any form is punishable under the Indian Information Technology Act, 2000 and Copyright Act, 1957.

This uses computer simulation and public domain data and is completely free from any data infringement.
#### ---------------------------------------------------------------------------------------------------------------