# Phase 1 Project

For this project, I am tasked with analyzing two datasets—"AviationData.csv" and "USState_Codes.csv"—to assist a company in its decision to expand into the aviation industry by purchasing aircraft. The analysis will focus on cleaning the data and extracting actionable insights to guide the head of the new aviation division in selecting aircraft that pose the lowest risk for commercial and private operations.
ess decisions.


### Aviation Dataset:
The "AviationData.csv" dataset is an extensive collection of historical aircraft accidents and incidents, documenting key information about each event. This data spans both commercial and private aviation, covering a wide range of locations, primarily within the United States but also internationally. 
 
### US State Codes Dataset:
The "USState_Codes.csv" dataset provides a mapping of state codes and names, which is useful for geospatial analysis when identifying accident locations across different U.S. states. 



### Key Goals:
By combining insights from both datasets, the objective is to:
1. Clean and prepare the data to ensure accuracy and consistency.
2. Analyze trends in accident causes, phases of flight, and environmental conditions.
3. Identify aircraft models involved in incidents with high fatality and injury rates.
4. Assess geographic trends by correlating accident locations with state codes.
5. Translate these findings into actionable insights that will help the head of the new aviation division make informed decisions on which aircraft models to prioritize for safe and efficient operations.

This analysis will ultimately provide a risk-based assessment of different aircraft, supporting safer and smarter business decisions.


### Getting Started

1. Import pandas and set the standard alias
2. Import matplotlib.pyplot and set the standard alias

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
#%matplotlib inline 

### Load the data

The data for this activity is stored in a file called 'AviationData.csv'

In [9]:
#Loading the data set
df = pd.read_csv('AviationData.csv', encoding = 'latin', low_memory=False )

Now, display the head of the DataFrame to ensure everything loaded correctly.

In [10]:
#Display first 5 records
df.head ()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In this analysis, I will begin by exploring the structure and size of the aviation dataset using df.columns and df.shape. This will provide an overview of the available variables and the total number of records. By identifying key columns such as aircraft details, accident locations, and injury counts, I will gain a better understanding of the data.
Next, I will apply df.describe() to generate summary statistics, which will offer valuable insights into the numeric variables, including the number of fatalities, serious injuries, and uninjured individuals. 


In [13]:
# Return the dimensions of the DataFrame
df.shape


(88889, 31)

In [15]:
#Return the columns in the DataFrame
df.columns


Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [16]:
# Summary statistics for the numeric columns in a DataFrame
df.describe


<bound method NDFrame.describe of              Event.Id Investigation.Type Accident.Number  Event.Date  \
0      20001218X45444           Accident      SEA87LA080  1948-10-24   
1      20001218X45447           Accident      LAX94LA336  1962-07-19   
2      20061025X01555           Accident      NYC07LA005  1974-08-30   
3      20001218X45448           Accident      LAX96LA321  1977-06-19   
4      20041105X01764           Accident      CHI79FA064  1979-08-02   
...               ...                ...             ...         ...   
88884  20221227106491           Accident      ERA23LA093  2022-12-26   
88885  20221227106494           Accident      ERA23LA095  2022-12-26   
88886  20221227106497           Accident      WPR23LA075  2022-12-26   
88887  20221227106498           Accident      WPR23LA076  2022-12-26   
88888  20221230106513           Accident      ERA23LA097  2022-12-29   

              Location        Country   Latitude   Longitude Airport.Code  \
0      MOOSE CREEK, ID  

In [17]:
df.isnull().sum()


Event.Id                      0
Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     52
Country                     226
Latitude                  54507
Longitude                 54516
Airport.Code              38757
Airport.Name              36185
Injury.Severity            1000
Aircraft.damage            3194
Aircraft.Category         56602
Registration.Number        1382
Make                         63
Model                        92
Amateur.Built               102
Number.of.Engines          6084
Engine.Type                7096
FAR.Description           56866
Schedule                  76307
Purpose.of.flight          6192
Air.carrier               72241
Total.Fatal.Injuries      11401
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Uninjured            5912
Weather.Condition          4492
Broad.phase.of.flight     27165
Report.Status              6384
Publication.Date          13771
dtype: i