# Univariate Lift Calculations
This notebook is meant to be a simple example of performing dynamic univariate lift calculations. Given a dataset of observations, we'll designate one column as our outcome and evaluate the rest as univariate features. For each feature in the dataset, we will identify the top 20 most common values and calculate lift for each.

To use this in your own analysis, you'll obviously need a CSV data file with your own observations. Be prepared to make some edits in the below, but the main analysis stage (identified below) should be completely generic and not need any editing. 

In [None]:
import pandas as pd
import numpy as np

## Reading Input Data
This open dataset provided by Airbnb (http://data.insideairbnb.com/united-states/nc/asheville/2019-02-17/data/listings.csv.gz) provides some detailed data for just over 2,000 rental listings. We're using only a subset of the columns and will use the rating score as our outcome.

Note that the original file must be unzipped (gzip).

In [None]:
input_filepath = 'c:/users/rbagley/downloads/listings.csv'
dataset = pd.read_csv(input_filepath, header=0, true_values = ['t'], false_values = ['f'],
    usecols=[
        'id', 'host_location', 'host_response_time', 'host_is_superhost','host_listings_count', 
        'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_cleansed',
        'zipcode', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
        'beds', 'bed_type', 'price', 'minimum_nights', 'maximum_nights',
        'review_scores_rating', 'instant_bookable', 'is_business_travel_ready',
        'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification']
    )

## Data Cleanup
* For sake of ease, let's reduce the review score from a 0-100 range to a boolean. Any score over 95 will be considered a positive rating
* Create buckets out of pricing values, in $50 increments

In [None]:
def is_positive_rating(x):
    if x >= 95: 
        return True
    return False

dataset['positive_rating'] = dataset['review_scores_rating'].apply(is_positive_rating)
dataset = dataset.drop(columns=['review_scores_rating'])

In [None]:
def create_price_group(x):
    price_int = float(str(x).replace('$','').replace(',',''))
    price_group = int(price_int / 50) * 50
    return price_group

dataset['price_group'] = dataset['price'].apply(create_price_group)
dataset = dataset.drop(columns=['price'])

In [None]:
dataset.columns

## Lift Analysis
Now, we'll iterate over each column to get some counts (totals and positive outcomes) per value in that column. Finally, we'll use these counts to get outcome probabilities per value and calculate lift per value.

In [None]:
outcome_col = 'positive_rating'
id_col = 'id'
feature_cols = ['host_location', 'host_response_time', 'host_is_superhost', 'host_listings_count', 'host_has_profile_pic', 
                'host_identity_verified', 'neighbourhood_cleansed', 'zipcode', 'is_location_exact', 'property_type', 
                'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'minimum_nights', 'maximum_nights', 
                'instant_bookable', 'is_business_travel_ready', 'cancellation_policy', 'require_guest_profile_picture', 
                'require_guest_phone_verification', 'price_group']


**NOTE!!!** Everything in this next code block is completely generic. No edits should be required!

In [None]:

# Need some global counts
total_count = len(dataset.index)
total_positives = len(dataset[(dataset[outcome_col] == True)])

# An empty list to hold the count results
count_list = list()

# Iterate over each feature to collect individual counts per value
for feature in feature_cols:
    counts_df = pd.DataFrame(dataset.groupby([feature,])[id_col].count()).nlargest(20,['id'])  # only top 20
    counts_df.rename({'id': 'count'}, axis='columns', inplace=True)
    positives_df = pd.DataFrame(dataset[(dataset[outcome_col] == True)].groupby([feature,])[id_col].count())  # all values
    positives_df.rename({'id': 'positives'}, axis='columns', inplace=True)
    # merge these dataframes
    merge_df = counts_df.merge(positives_df, left_index=True, right_index=True)
    # iterate over rows, building a dict per row, and append each to list of counts
    for this_row in merge_df.iterrows():
        this_dict = {
            'feature': feature,
            'value': str(this_row[0]),
            'total_count': total_count,
            'total_positives': total_positives,
            'count': this_row[1]['count'],
            'positives': this_row[1]['positives']
        }
        count_list.append(this_dict)

# Create a new dataframe from the aggregated list
lift_df = pd.DataFrame(count_list)

# Now let's add some calculations for probabilities and lift per row
lift_df['total_prob'] = lift_df['total_positives'] / lift_df['total_count']
lift_df['prob'] = lift_df['positives'] / lift_df['count']
lift_df['lift'] = lift_df['prob'] / lift_df['total_prob']
lift_df['1/lift'] = 1 / lift_df['lift']
lift_df['prct_total'] = lift_df['count'] / lift_df['total_count']

## Output CSV
Push this final lift dataframe to a csv file for use elsewhere...

In [None]:
output_filepath = 'c:/users/rbagley/downloads/lift_output.csv'

lift_df.to_csv(output_filepath, index=False)

&copy; Mackinac Data Group, 2019