<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       A Study of Car Complaints Data using Geospatial Analysis and Outlier Detection
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Customer complaints are often tricky to handle, with multiple data sources not often used simultaneously (including text and geolocation of complaints, service centres, etc.) This demo highlights Vantage features that address this problem with assistance from graphics libraries in Python. <br>
The demo seeks to provide the business user a fuller view of their customers where to focus highlighting cases for special attention. 
<br>Key benefits of this kind of analysis
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li>Better understanding of the current status through visual representation</li>
<li>Easy implementation meant to scale as it leverages Vantage functionalities</li>
<li>Unconventional usage of Vantage functions (geospatial for attribution, multi-level outlier detection)</li>   </ul>
<br>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Data is sourced from a public database by the National Highway Traffic Safety Administration (NHTSA) of the USA with a few modifications to analyse data on a state-county level.
  <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>1000 records of cars by Ford motors were randomly sampled</li>
    <li>Each complaint was added a geolocation (latitude, longitude) to a location in Iowa (to simulate data coming from a single state)</li><li>Records were limited to 2019</li></ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will do two kind of analysis
      <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Service Center Analysis</b> - A geospatial attribution of complaints to its nearby service centres and county-specific ranking to search for blind spots.</li>
    <li><b>Defect Analysis</b> - Defect outlier detection to spot complaints with parts that defected earlier than expected.</li>
    </ul>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [None]:
#import libraries
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import geopandas as gpd
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns
import getpass

from teradataml import *
import plotly.express as px
import json
from pandas import json_normalize
import numpy as np
# import plotly.express as px
from  ipywidgets import widgets, interact

display.max_rows = 5 


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then <b>use down arrow</b> to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Car_Complaints_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_Car_cloud');"
 # takes about 30 seconds, estimated space: 0 MB
%run -i ../run_procedure.py "call get_data('DEMO_Car_local');" 
# takes about 1min 40seconds, estimated space: 1.5 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Initial Data Sets</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us take a look at the source tables we have.</p>

In [None]:
df1 = DataFrame(in_schema("DEMO_Car", "Complaint_Locations"))
df1

In [None]:
df1.shape

In [None]:
df1.dtypes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From above we can see <i>Complaint_Locations</i> table has complaint id and the geo location from where the complaint was raised. It has 1000 records in it.</p>

In [None]:
df2 = DataFrame(in_schema("DEMO_Car", "Service_Centers"))
df2

In [None]:
df2.shape

In [None]:
df2.dtypes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From above we can see <i>Service_Centers</i> table has information of the service center. We have information on 138 service centers.</p>

In [None]:
df3 = DataFrame(in_schema("DEMO_Car", "Complaints"))
df3

In [None]:
df3.shape

In [None]:
df3.dtypes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i>Complaint</i> table has the information of the complaint registered.</p>

<p style = 'font-size:18px;font-family:Arial> Additionaly we have taken the IOWA county data (https://geodata.iowa.gov/datasets/iowa::iowa-county-boundaries/) 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Additionally we have also taken the IOWA county boundary information from https://geodata.iowa.gov/datasets/iowa::iowa-county-boundaries

In [None]:
df4 = DataFrame(in_schema("DEMO_Car", "Counties"))
df4

In [None]:
df4.shape

In [None]:
df4.dtypes

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Service Center Analysis
 </b></p>
 

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
        <li>Measure service center workload based on customer demand and location. For this we Calculated workload as an attribution score per service center using geo functions based on distances to customer locations.</li>
<li>Identify problem areas or service blind spots based on population, customer demand, and distances between customers and centres. For this we calculated an aggregated performance score per area unit based on multiple metrics. </li> </ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Benefits from this kind of analysis is that clients can allocate resources to even out workload among service centers based on the attribution scores also this can help them to locate new store location.
<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Distance-based Attribution
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The typical approach to attribution is mainly as follows
    <b>Attribution Function in Vantage</b>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>     
        <li>Row-wise time and transaction-based data</li>
<li>Credit is distributed to events based on time<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
        <li>FIRST_CLICK</li>
        <li>LAST_CLICK</li>
        <li>UNIFORM</li>
    <li>EXPONENTIAL</li></ul>
</li>
        <br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In our case we have used distance based attribution where we distribute attribution to service centres within X km of a complaint. If none, give full credit to the nearest service center.

<img id="distance_attrib" src="distance_attrib.PNG" alt="Attribution score" width="800" />
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Distance based attribution.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> We have used <b>ST_Geometry</b> data type in Vantage to load and used the geographic data. The coordinates (latitude and longitude) of the service centers and complaint locations are loaded as ST_Point; also the county information of the IOWA state is loaded as ST_POLYGON is loaded to get the country boundary.<br>We have used <b>ST_SphericalDistance</b> function to find the nearest service center to each complaint location and the respective distance between points and ST_Within in a WHERE clause to determine which county (ST_Polygon) a certain complaint (ST_Point) belongs to. 

In [None]:
%%capture
execute_sql('''
REPLACE VIEW sc_attribution AS
 -- distances between service centers and complaints
WITH distances as (
    SELECT b.service_center_id as sc_id,
        b.geometry as sc_geom,
        a.cmplid as comp_id,
        b.start_yearmonth,
        b.end_yearmonth,
        a.geometry.ST_SphericalDistance(b.geometry)/1000 as dist --distance in kilometers
    FROM DEMO_CAR.complaint_locations a, DEMO_CAR.service_centers b),
-- nearest service centers to each complaint
nearest as (
    SELECT sc_id,
        sc_geom,
        comp_id, 
        start_yearmonth,
        end_yearmonth,
        dist
    FROM distances
    QUALIFY ROW_NUMBER() OVER(PARTITION BY comp_id ORDER BY DIST) = 1)
-- table of service centers and aggregated attribution scores 

    SELECT sc_id service_center_id,
        'Station '||CAST(sc_id AS CHAR(3)) as service_center,
        sc_geom.ST_Y() as lat,
        sc_geom.ST_X() as "long",
        start_yearmonth,
        end_yearmonth,
        SUM(attrib_score) as attribution_score, -- total attribution score is the sum of attribution scores across complaints 
        RANK() OVER (ORDER BY attribution_score DESC) as attribution_score_rank
    FROM (
        SELECT comp_id, 
            sc_id,
            sc_geom,
            start_yearmonth,
            end_yearmonth,
            1.000/ (COUNT(sc_id) OVER (PARTITION BY comp_id)) as attrib_score
        FROM distances
        WHERE dist < 15
    
        UNION ALL
    
        SELECT comp_id, 
            sc_id,
            sc_geom,
            start_yearmonth,
            end_yearmonth,
            1
        FROM nearest
        WHERE dist > 15
    
        UNION ALL
    
        SELECT comp_id, 
            sc_id,
            sc_geom,
            start_yearmonth,
            end_yearmonth,
            0
        FROM distances
        WHERE dist > 15
        ) AS attrib_scores
    GROUP BY 1,2,3,4,5,6
; ''')    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's see how the data in this view looks like 

In [None]:
sc = DataFrame("sc_attribution")
sc

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now let us plot this data to see how the service center attribution changes over the years.<br><i>*Please click on the play button to start animation. 

In [None]:
qry = ''' 
select  service_center_id,service_center,lat,"long",start_yearmonth,attribution_score
from sc_attribution group by 1,2,3,4,5,6 ;
'''

out1= DataFrame.from_query(qry)
df=out1.to_pandas()
fig1 = px.scatter_mapbox(df,lat="lat", lon="long", hover_name="service_center",size=pd.to_numeric(df['attribution_score']),
                         color="attribution_score", size_max=70, zoom=6, 
                         animation_frame="start_yearmonth",
                         category_orders={"start_yearmonth": [201801,201802,202803,201804,201805,201806,201807,201808,
                                                              201809,201810, 201811, 201812,201901, 201902, 201903,201904, 
                                                              201905, 201906, 201907,201908, 201909,201910, 201911, 201912]}, 
                         color_continuous_scale=px.colors.sequential.Bluered, 
                         height = 600
                  )
fig1.update_layout(mapbox_style="open-street-map")
fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig1.update_layout(title_text = 'Service_center ranks over the years' ,title_y=1)
fig1.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the map above we can see the year wise attribution score based on the calculations we have done in the view earlier.

<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Area Rankings 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Each area is ranked according by multiple metrics, and these rankings are combined to score the area.
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Population </li>
    <li>No. Of Customers/ Complaints</li>
    <li>No. Of Non-Covered Customers/ Complaints</li>
    <li>Percent of Non-Covered Customers/ Complaints</li>
    </ul>
    </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have created a view on the service center information and the IOWA county data to get the ranks    

In [None]:
%%capture
execute_sql('''
REPLACE VIEW county_accessibility AS 
-- table of service centers nearest to each complaint
    WITH nearest as (
    SELECT a.cmplid,
        a.geometry as cmpl_geom,
        b.service_center_id as sc_id,
        b.geometry as sc_geom,
        a.geometry.ST_SphericalDistance(b.geometry)/1000 as dist
    FROM demo_car.complaint_locations as a, demo_car.service_centers as b
    QUALIFY ROW_NUMBER() OVER(PARTITION BY cmplid ORDER BY dist) = 1
    )

    SELECT county_id,
        county_name,
        population,
        RANK() OVER(ORDER BY population DESC) as population_rank,
        -- number of complaints per county whose nearest service center is more than x distance away
        COUNT(
            CASE WHEN (dist > 15) AND a.cmpl_geom.ST_Within(b.geometry) = 1 THEN cmplid
                ELSE NULL
            END) as noncovered_customers,
        RANK() OVER(ORDER BY noncovered_customers DESC) as noncovered_customers_rank,
        COUNT(
            CASE WHEN a.cmpl_geom.ST_Within(b.geometry) = 1 THEN cmplid
                ELSE NULL
            END) as customers,
        RANK() OVER(ORDER BY customers DESC) as customers_rank,
        CASE WHEN customers = 0 THEN NULL
            ELSE (noncovered_customers*1.0000)/(customers*1.0000) 
        END as noncovered_customers_pct,
        COALESCE(noncovered_customers_pct,0) AS noncovered_customers_pct2,        
        CASE WHEN noncovered_customers_pct IS NULL THEN NULL 
            ELSE RANK() OVER(ORDER BY noncovered_customers_pct DESC) 
        END as noncovered_customers_pct_rank
    FROM nearest a,
        demo_car.counties b
        where a.cmpl_geom.st_x() between geo_mbr.xmin() and geo_mbr.xmax() 
        and a.cmpl_geom.st_y() between geo_mbr.ymin() and geo_mbr.ymax()
    GROUP BY 1,2,3
; ''')    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i>* The view query has complex geometry calculation, the below step takes approx 1min to run.</i>

In [None]:
df = DataFrame.from_query('''select county_id,
    county_name,
    population_rank,
    noncovered_customers_rank,
    customers_rank,
    noncovered_customers_pct_rank
    from county_accessibility;''')
ca=df.to_pandas()
ca.head(5)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us visualize this data

In [None]:
#read the geojson files to get boundry information of the counties
with open ("./data/Iowa_County_Boundaries.geojson",'r') as infile:
    counties = json.load(infile)

In [None]:
def plot_map(candidate):    
    plt.figure(figsize=(8, 6))
    fig = px.choropleth_mapbox(
        ca, geojson=counties, color=candidate,
        locations="county_id", featureidkey="properties.FIPS",
        center={"lat": 42.032974, "lon": -93.581543},
                           mapbox_style="open-street-map",
                           zoom=5)
    fig.update_geos(fitbounds="locations", visible=False)
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return fig

# Create the dropdown widget
candidates = ["population_rank","noncovered_customers_rank", "customers_rank","noncovered_customers_pct_rank"]
candidate_dropdown = widgets.Dropdown(options=candidates, description='Candidate:', value='population_rank')

# Call the plot_clusters function with the selected dropdown options
def update_plot(candidate):    
    plot_map(candidate).show()
    
widgets.interact(update_plot, candidate=candidate_dropdown)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i>*Please note that the map takes few seconds to refresh after dropdown change</i><br>As we can see the map above shows the counties based on the various ranks we have calculated

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Defect Analysis
 </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> In our demo we will find complaints where car parts had early defects from the date of purchase compared to other complaints. We will try to find  if there is any outliers or anomalies in the complaint data. In simple words an outlier is a data point that differs significantly from other observations.<br>The main benefits from this type of analysis are that <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Car insurance providers or manufacturers can detect suspicious car defect complaints.</li>                               <li>Helps car companies in determining faulty car models and car parts.</li>
    <li>Car manufacturers can use this to know where to improve in models with numerous early defect complaints.</li></ul>
        <p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use Vantage's TD_OutlierFit and Transforms functions to find the outliers in data and analyse them.

In [None]:
#we have pulled the complaints table earlier we'll use that dataframe 
com=df3.to_pandas()
com.head()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us plot a boxplot on the complaints data. A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum. The term <b>box plot</b> comes from the fact that the graph looks like a rectangle with lines extending from the top and bottom.

In [None]:
plt.figure(figsize=(20,8))
#plt.xlabel('Sales Date', fontsize=16, rotation=45)--- ye rotation waala parameter try karo 
#plt.xlabel('car_part', fontsize=16, rotation=45)
plt.tick_params(axis='x', which='major', labelsize=14, rotation=90)
plt.xlabel('car_part', fontsize=16);
plt.ylabel('days_to_defect', fontsize=16);
plt.title('Box plot of the data by car part', fontsize=20)
ax = sns.boxplot(x = 'car_part', y = 'days_to_defect', data = com)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The widths of the box plot indicate the size of the samples. The wider the box, the larger the sample. As we have many car parts in our dataset let us select few and plot them again to get a better visual.

In [None]:
options = ['STEERING' ,'ENGINE','STRUCTURE','POWER TRAIN'] 

#com[com['car_part']=='STEERING']
plt.figure(figsize=(20,8))
#plt.xlabel('Sales Date', fontsize=16, rotation=45)--- ye rotation waala parameter try karo 
#plt.xlabel('car_part', fontsize=16, rotation=45)
plt.tick_params(axis='x', which='major', labelsize=14, rotation=90)
plt.xlabel('car_part', fontsize=16);
plt.ylabel('days_to_defect', fontsize=16);
plt.title('Box plot of the data by car part', fontsize=20)
ax = sns.boxplot(x = 'car_part', y = 'days_to_defect', data = com[com['car_part'].isin(options)])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The mid line inside the box is the median(Q2) of data and lower(Q1) and top lines(Q3) of the box is 25% and 75% of the data. The lowest limit value equals Q1 – 1.5 * (Q3-Q1) and the upper limit value equals Q3 + 1.5 * (Q3-Q1). Any points that lie beyond the limit points are considered outliers.

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Now let us try to find the outliers in the data. <b>OutlierFilterFit</b> from teradataml library calculates the lower_percentile, upper_percentile, count of rows and median for the specified input table columns. The calculated values are passed to <b>OutlierFilterTransform</b> function to filter out the outliers from the dataset. We are using Tukey method ([Q1 - k*(Q3-Q1), Q1 + k*(Q3-Q1)] where k is interquantile range multiplier) for outlier detection, other methods available are Percentile and Carling. Please refer to documentation for a full listing of parameters and return values.

In [None]:
from teradataml import OutlierFilterFit, OutlierFilterTransform

fit_obj = OutlierFilterFit(data=df3,
                               target_columns="days_to_defect",
                               outlier_method="TUKEY",
                               replacement_value="DELETE",
                               iqr_multiplier=0.1,
                               remove_tail="LOWER",
                               group_columns="car_part")

In [None]:
fit_obj.result.sample(n = 5)

In [None]:
obj = OutlierFilterTransform(data=df3,data_partition_column="car_part",
                             object=fit_obj.result,
                             object_partition_column="car_part")

In [None]:
df5=(obj.result).to_pandas()
df5.head(5)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Our source input dataframe had 1000 records where as the transformed dataframe has 829 records. Let us minus the two dataframes to get only the records which are marked as outliers based on our input parameters.

In [None]:
from teradataml.dataframe.setop import td_minus

In [None]:
idf = td_minus([df3, obj.result], allow_duplicates=False)
df6=idf.to_pandas()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us plot this data to see which car parts have the higher number of complaints.

In [None]:
plt.figure(figsize=(20,8))
plt.tick_params(axis='x', which='major', labelsize=14, rotation=90)
ax = sns.countplot(x="car_part",data=df6)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we can see from the plot above the steering and power train have the most number of complaints. We can do similar analysis on car models also.

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Cleanup</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Car');" 
#Takes 10 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we have seen in this demo that we can get great insights from our data if we augment the data with its geographical parameters. We have also seen that the anomalies happen in the data which may or may not be a cause of concern but analysis on them can lead to better insights on how business can enhance their processes or divert the resources where needed.

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
       <li>Teradata Package for Python Function Reference: <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference '>https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference </a></li>    
 

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024. All Rights Reserved
        </div>
    </div>
</footer>