<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Time Series Analysis - Analyzing Consumer Complaints Over Time
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In this example we will be analyzing the number of complaints over time received by the Consumer Financial Protection Bureau (CFPB).</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
How can we use Vantage to extract insights and tell a story behind a dataset? In this use case, you will see how powerful and simple it is to extract answers from a public dataset available through <a href="http://data.gov">Data.gov</a>. We use SQL and a visualization tool to analyze the number of complaints over time to answer the following questions:</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
 <i>What are the trends of complaints over time? How can we interpret the outliers in the dataset?</i>
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>    
By answering questions like the ones above, we gain a deeper understanding of the dataset, and we can explain in plain language how the number of complaints evolve over time. In the Explore section, we focus on analyzing the number of complaints over time and identifying trends and outliers in the time series to answer the questions above.
</p>    

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [None]:
%connect local, hidewarnings=true

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
Set query_band='DEMO=TimeSeriesAnalysis.ipynb;' update for session;

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>

In [None]:
call get_data('DEMO_Financial_cloud');    -- takes about 50 seconds, estimated space: 0 MB
--call get_data('DEMO_Financial_local');     -- takes about 4 minutes, estimated space: 300 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
call space_report();  -- optional, takes about 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Querying the Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have made our connection to the Vantage system, now let's start exploring the data. We'll start by counting the number of rows in the table.</p>

In [None]:
select count(*) from DEMO_Financial.Consumer_complaints;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
There are just under 1.3 million rows. Not a problem to analyze large datasets using Vantage, lets take a look at a sample of the data.</p>

In [None]:
select TOP 5 * from DEMO_Financial.Consumer_Complaints;

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Visualizing the Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
From the query above, we notice that this dataset has a lot of information. To derive some insights, we need to start grouping the data.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The first column is <b>date_received</b>. This is the date the complaints were received, and it means that we can look at a time series plot of the data. Let's start by grouping the counts of <b>complaint_id</b> over time, using <b>date_received</b> as our time axis.</p>

In [None]:
select date_received, count(complaint_id) as counts
from DEMO_Financial.Consumer_Complaints
group by date_received;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
This is great; we now have the number of complaints (<b>counts</b>) by time (<b>date_received</b>), but how do we make sense of this data? Let's plot this time series on a graph.</p>

In [None]:
%chart date_received, counts, title='Number of Complaints over Time', width=900, height=400

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
By visualizing the data above, we can see that the number of complaints varies a lot over time and there also seem to be more complaints as time progresses. There are also some unusual spikes in 2017. Let's understand more about our data. We start by looking at the general trend.
<br>
<br>
Let's group the data by month and re-plot the graph above.</p>

In [None]:
select extract(year from date_received) || extract(month from date_received) as month_date, count(complaint_id) as counts
from DEMO_Financial.Consumer_Complaints
group by month_date
order by month_date;

In [None]:
%chart month_date, counts, title='Number of Complaints by Month and Year', width=900, height=400, mark=line

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Looking at complaints over month and year, we see there is clearly an upward trend. One hypothesis is that as time progresses, people get more conscious and spread the word. The media can also advertise the complaint channels over time. Through this chart we can see clearly the spikes that we saw above were in January 2017 and September 2017. Let's dive deeper into these dates and draw some insights on the next step.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Extracting Insights from the Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let's narrow down the two spikes above and see exactly where they are happening. We can do this by plotting another time series plot, this time only in 2017.</p>

In [None]:
select date_received, count(complaint_id) as counts
from DEMO_Financial.Consumer_Complaints
where year(date_received) = 2017
group by date_received
order by date_received;

In [None]:
%chart date_received, counts, title='Complaints over time - 2017', width=900, height=400, mark=line

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we look at the peaks, we find that they occurred from January 15th to 21st and during the first week of September. To find the actual dates of the peaks, we can limit the query to pick up at least 1,500 complaints a day.</p>

In [None]:
select date_received,
    month(date_received) as month_date,
    count(complaint_id) as counts
from DEMO_Financial.Consumer_Complaints
where year(date_received) = 2017 and month_date in (1, 9)
group by date_received
having counts >= 1500
order by month_date, counts desc;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's look at some of the issues that were reported during these dates.</p>

In [None]:
select date_received, company, count(company) as counts
from DEMO_Financial.Consumer_Complaints
where date_received in (
    date '2017-01-19',
    date '2017-01-20',
    date '2017-09-08',
    date '2017-09-09',
    date '2017-09-13'
)
group by date_received, company
having counts > 500
order by date_received, counts desc;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Interestingly, we can see that the great majority of the the complaints were directed at two companies: Navient Solutions and EQUIFAX. These seem to be highly correlated with the Navient Lawsuit and the Equifax breach events that happened around those dates, respectively. Let's recap what happened:</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
<blockquote><i>Navient Lawsuit: On January 2017, the U.S. Consumer Financial Protection Bureau (CFPB) and the Illinois and Washington attorneys general sued Navient Solutions. Navient is a major servicer of private and federal student loans. According to the CFPB at least since January 2010 "Navient has misallocated payments, steered struggling borrowers toward multiple forbearances instead of income-driven repayment plans, and provided unclear information about how to re-enrol in income-driven repayment plans and how to qualify for a co-signer release"

Equifax Breach: On September 7th, 2017 Equifax announced a cybersecurity breach, one of the largest in history, had happened from mid-May through July 2017. Some of the personal information that was accessed included names, social security numbers, birth dates, addresses and driver's license numbers.</i></blockquote>
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let's now look at the top issues for Navient Solutions and Equifax during those periods to confirm our hypothesis.</p>

In [None]:
-- analyze top issues reported against Navient Solutions on 2017-01-19 and 2017-01-20
select company, product, issue, count(issue) as counts
from DEMO_Financial.Consumer_Complaints
where date_received in (
    date '2017-01-19',
    date '2017-01-20') and
    company like 'Navient Solutions%'
group by company, product, issue
order by counts desc;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
We can see the top two issues represent the majority of complaint counts against Navient Solutions. Furthermore, by looking at the product and issue columns we can infer that they are indeed related to the lawsuit regarding student loans. Now let's do the same analysis for the Equifax issues.</p>

In [None]:
-- analyze top issues reported against EQUIFAX on 2017-01-19 and 2017-01-20
select
    company,
    product,
    issue,
    count(issue) as counts
from DEMO_Financial.Consumer_Complaints
where date_received in (
    date '2017-09-08',
    date '2017-09-09',
    date '2017-09-13') and
        company like 'EQUIFAX%'
group by company, product, issue
order by counts desc;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can also confirm our hypothesis. The top issues talk about improper use of the credit report, fraud alerts, identity theft etc. This really does seem related to the Equifax breach that happened around the same time frame.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Cleanup </b></p>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Database and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
call remove_data('DEMO_Financial');
-- takes about 10 seconds, optional if you want to use the data later
--the same data is used in UseCases/VantageAnalyticLibrary and UseCases/FSCustomerJourney

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Dataset</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The Consumer Complaints Database has complaints data that was received by the Consumer Financial Protection Bureau (CFPB) on financial products and services, which include but are not limited to bank accounts, credit cards, credit reporting, debt collection, money transfers, mortgages, student loans and other types of consumer credit. The dataset is refreshed daily and contains information on the provider, the complaint, date, ZIP code and more. More information about the dataset can be found in the Consumer section of the <a href="data.gov">Data.gov</a> website.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
The <b>Demo_Financial.consumer_complaints</b> dataset has 1,273,782 rows, each representing a unique consumer complaint, and 18 columns, representing the following features:</p>

- `date_received`: date that CFPB received the complaint
- `product`: type of product the consumer identified in the complaint
- `sub_product`: type of sub-product the consumer identified in the complaint
- `issue`: issue the consumer identified in the complaint
- `sub_issue`: sub-issue the consumer identified in the complaint
- `consumer_complaint_narrative`: consumer-submitted description of "what happened" from the complaint
- `company_public_response`: company's optional, public-facing response to a consumer's complaint
- `company`: complaint is about this company
- `state`: state of the mailing address provided by the consumer
- `zip_code`: mailing ZIP code provided by the consumer
- `tags`: data that supports easier searching and sorting of complaints submitted by or on behalf of consumers
- `consumer_consent_provided`: identifies whether the consumer option in to publish their complaint narrative
- `submitted_via`: how the complaint was submitted to the CFPB
- `date_sent_to_company`: date the CFBP sent the complaint to the company
- `company_response_to_consumer`: how the company responded
- `timely_response`: whether the company gave a timely response
- `consumer_disputed`: whether the company disputed the company's response
- `complaint_id`: unique identification number for a complaint

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Explore</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>   
Through this notebook, we saw the power and simplicity of running queries in the SQL Editor and how it can be leveraged to extract insights from the data to tell the story behind a dataset. Hopefully you've noticed how easy it is to use Vantage to write your own SQL queries.<br>You can continue to explore Vantage to extract more insights and find answers to other questions by using the preloaded dataset. Here are some suggestions:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>What are the most common types of complaints? By grouping the <b>product</b> category, we can arrive at this answer. How does this change over time?</li>
    <li>How are customers submitting their complaints? The column <b>submitted_via</b> can also be grouped to answer for this question.</li>
    <li>What proportion of the customer complaints are disputed? By aggregating counts of <b>customer_disputed</b> we can answer this question.</li>
    <li>Is there seasonality in the data? What is the reason for the seasonality? If we subtract the trend from the series we can analyze the seasonality in the dataset. Are most of the complaints filed during the week or on the weekends?</li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>