<a href="https://colab.research.google.com/github/Requenamar3/datawrangling/blob/main/Martha_Requena_Airport1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Scenario**

An analyst employed by the U.S. Transportation Security Administration (TSA). needs to produce a report regarding insurance claims against airports in the U.S.

#**Structure Analisis**

In [None]:
# Set the Environment
# Ignore Warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
#Write out the versions of all packages to requirements.txt
!pip freeze >> requirements.txt
#!pip unfreeze requirements.txt

# Remove the restriction on Jupyter that limits the columns displayed (the ... in the middle)
pd.set_option('display.max_columns', None)
# Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#

# Pretty Display of variables.  for instance, you can call df.head() and df.tail() in the same cell and BOTH display w/o print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# List of ALL Magic Commands.  To run a magic command %var  --- i.e.:  %env
%lsmagic
# %env  -- list environment variables
# %%time  -- gives you information about how long a cel took to run
# %%timeit -- runs a cell 100,000 times and then gives you the average time the cell will take to run (can be LONG)
# %pdb -- python debugger

# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

print(np.__version__)
print(sklearn.__version__)

1.25.2
1.2.2


I used Pandas data profiler for all the data set that I work on because it significantly streamlines my initial data analysis. It quickly provides a comprehensive overview, highlighting key statistics and identifying any missing data. It also helps me detect outliers and visualizes data distributions, which is crucial for informing my data cleaning and preparation tasks. The profiler is particularly valuable when working with large datasets like this one because it automates many aspects of what would otherwise be a very time-consuming manual process.

In [None]:
 # installing the pandas_profiling package for data analysis and generating statistical report summaries.
!pip install ydata_profiling




In [None]:
import pandas as pd

# Import the ProfileReport .For creating comprehensive exploratory data analysis reports.
from ydata_profiling import ProfileReport

TSA = pd.read_csv("https://raw.githubusercontent.com/fenago/datasets/main/tsa_claims1.csv")


  TSA = pd.read_csv("https://raw.githubusercontent.com/fenago/datasets/main/tsa_claims1.csv")


In [None]:
# Create a ProfileReport object from the TSA dataframe.
profile = ProfileReport(TSA, title="TSA Claims Data Analysis", explorative=True)

In [None]:
# Display interactive report
profile



The dataset has 94K records, with 13 variables with different data types. No missing values, 19 duplicates. Right from the get go I see some out of range dates and some variables with the wrong data type.

In [None]:
TSA.sample(5).T

Unnamed: 0,84524,47043,50511,67488,6913
Claim Number,2008121953554,2006060707490,2006083111898,2007101233343,1016074M
Date Received,15-Dec-08,5-Jun-06,28-Aug-06,9-Oct-07,16-Oct-03
Incident Date,12/12/2008 0:00,4/17/2006 0:00,8/6/2006 0:00,9/27/2007 0:00,9/9/2003 0:00
Airport Code,PHX,MIA,HNL,PHL,DFW
Airport Name,Phoenix Sky Harbor International,Miami International Airport,Honolulu International Airport,Philadelphia International Airport,Dallas-Fort Worth International Airport
Airline Name,Frontier Airlines,Air Jamaica,Aloha Airlines,USAir,American Airlines
Claim Type,Passenger Property Loss,Property Damage,Passenger Property Loss,Passenger Property Loss,Passenger Property Loss
Claim Site,Checked Baggage,Checked Baggage,Checked Baggage,Checked Baggage,Checkpoint
Item,Locks,Clothing - Shoes; belts; accessories; etc.; Lu...,Locks,Jewelry - Fine,Currency
Claim Amount,30.95,200.0,10.0,899.99,900.0


In [None]:
TSA.info(verbose=True, memory_usage="deep",show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94848 entries, 0 to 94847
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Claim Number   94848 non-null  object 
 1   Date Received  94848 non-null  object 
 2   Incident Date  94848 non-null  object 
 3   Airport Code   94848 non-null  object 
 4   Airport Name   94848 non-null  object 
 5   Airline Name   94848 non-null  object 
 6   Claim Type     94848 non-null  object 
 7   Claim Site     94848 non-null  object 
 8   Item           94848 non-null  object 
 9   Claim Amount   94848 non-null  float64
 10  Status         94848 non-null  object 
 11  Close Amount   94848 non-null  float64
 12  Disposition    94848 non-null  object 
dtypes: float64(2), object(11)
memory usage: 71.7 MB


##**Quality Analisis**

In [None]:
# Check number of duplicates while ignoring the index feature
n_duplicates = TSA.drop(labels=['Claim Number'], axis=1).duplicated().sum()

print(f"You seem to have {n_duplicates} duplicates in your database.")

You seem to have 19 duplicates in your database.


In [None]:
# Find the duplicate 'Claim Number' entries
duplicate_claim_numbers = TSA[TSA.duplicated('Claim Number', keep=False)]

# List the duplicate 'Claim Number' entries
list_of_duplicate_claim_numbers = duplicate_claim_numbers['Claim Number'].unique().tolist()


In [None]:
# Print the list of duplicate 'Claim Number' entries
print(list_of_duplicate_claim_numbers)


[]


**1-What is the most common type of insurance claim?**



**2-Which claim site within the airport are claims most commonly filed for?**



**3-What type of claim is made most at each claim site?**

**4-What is the typical claim amount?**

**5-What is the overall claim approval rate for the entire U.S.?**

In [None]:
# Get a dictionary of column names and data types
data_dictionary = TSA.dtypes.apply(lambda x: x.name).to_dict()

# Print the dictionary
print(data_dictionary)

{'Claim Number': 'object', 'Date Received': 'object', 'Incident Date': 'object', 'Airport Code': 'object', 'Airport Name': 'object', 'Airline Name': 'object', 'Claim Type': 'object', 'Claim Site': 'object', 'Item': 'object', 'Claim Amount': 'float64', 'Status': 'object', 'Close Amount': 'float64', 'Disposition': 'object'}


The code TSA.dtypes.apply(lambda x: x.name).to_dict() constructs a dictionary from the DataFrame TSA where each column name is mapped to its data type. TSA.dtypes obtains the data types as a Series, .apply(lambda x: x.name) transforms each data type into its string name, and .to_dict() converts the Series into a dictionary with column names as keys and data type names as values