# Project: Investigate a Dataset (FBI open up!)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

> ## information about the data set 

>The data comes from the FBI's National Instant Criminal Background Check System. The NICS is used by to
determine whether a prospective
buyer is eligible to buy firearms or
explosives. Gun shops call into this
system to ensure that each customer
does not have a criminal record or
isn’t otherwise ineligible to make a
purchase. The data has been
supplemented with state level data
from census.gov.

> * The NICS data is found in one
sheet of an .xlsx file. It contains
the number of firearm checks by
month, state, and type.
> * The U.S. census data is found in a .csv file. It contains several
variables at the state level. Most
variables just have one data
point per state (2016), but a few
have data for more than one
year.

> ## questions to ask
* What census data is most associated with high gun per capita?
* Which states have had the highest growth in gun registrations?
* What is the overall trend of gun purchases?

>## more questions to ask 
* most state to buy guns in total?
* most state to spen money on guns in total?
* is there is a relationship betweet the month and the guns bought?

In [81]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
#import needed libiraries 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> In this section of the report, we will load in the data, check for cleanliness, and then trim and clean our dataset for analysis.

## steps taken
* 
* 
* 

### General Properties

In [82]:
#plz put your path of the data here
import os
#path = 'E:/career/Dataa/udacity/Advanced Data Analysis Nanodegree Program/2. Introduction to Data Analysis/03 Investigate A Dataset Project/data'
#os.chdir(path)
print('path is:', os.getcwd())
print('the data there:' , os.listdir())

path is: E:\career\Dataa\udacity\Advanced Data Analysis Nanodegree Program\2. Introduction to Data Analysis\03 Investigate A Dataset Project\data
the data there: ['Data.zip', 'gun_data.xlsx', 'New folder', 'U.S. Census Data.csv']


In [83]:
#check if the data is there to start the good work mate <3
if os.path.exists('gun_data.xlsx') and os.path.exists('U.S. Census Data.csv'):
    print('you are ok to start')
else:
    from zipfile import ZipFile
    import urllib.request
    datalink = "https://d17h27t6h515a5.cloudfront.net/topher/2017/November/5a0a5623_ncis-and-census-data/ncis-and-census-data.zip"
    urllib.request.urlretrieve(datalink, 'Data.zip')
    # Create a ZipFile Object and load sample.zip in it
    with ZipFile('Data.zip', 'r') as zipObj:
       # Extract all the contents of zip file in current directory
       zipObj.extractall()
    print('the data is downloaded and unziped u r good to go')

you are ok to start


In [84]:
#load the data into work space 
df_census = pd.read_csv('U.S. Census Data.csv')
df_gun = pd.read_excel('gun_data.xlsx')

In [85]:
#shape of the data set

print('df_gun', df_gun.shape)

df_gun (12485, 27)


# General Properties census 
* rows and columns Number 
* data stored in each column 
* missing values 
* duplicated values

In [86]:
print('df_census shape is:' , df_census.shape)

df_census shape is: (85, 52)


In [87]:
#alot of columns on each data set alone 
#lets see them 
for i, v in enumerate(df_census.columns):
    print(i, v)

0 Fact
1 Fact Note
2 Alabama
3 Alaska
4 Arizona
5 Arkansas
6 California
7 Colorado
8 Connecticut
9 Delaware
10 Florida
11 Georgia
12 Hawaii
13 Idaho
14 Illinois
15 Indiana
16 Iowa
17 Kansas
18 Kentucky
19 Louisiana
20 Maine
21 Maryland
22 Massachusetts
23 Michigan
24 Minnesota
25 Mississippi
26 Missouri
27 Montana
28 Nebraska
29 Nevada
30 New Hampshire
31 New Jersey
32 New Mexico
33 New York
34 North Carolina
35 North Dakota
36 Ohio
37 Oklahoma
38 Oregon
39 Pennsylvania
40 Rhode Island
41 South Carolina
42 South Dakota
43 Tennessee
44 Texas
45 Utah
46 Vermont
47 Virginia
48 Washington
49 West Virginia
50 Wisconsin
51 Wyoming


In [88]:
#looks like its the fact, and the facts notes and the 50 state of U S 
#so what is facts ?
df_census['Fact'].head(5)
#looks like fact are just quetion were asked to them and they comapre it to the states for some reason

0         Population estimates, July 1, 2016,  (V2016)
1    Population estimates base, April 1, 2010,  (V2...
2    Population, percent change - April 1, 2010 (es...
3                    Population, Census, April 1, 2010
4    Persons under 5 years, percent, July 1, 2016, ...
Name: Fact, dtype: object

In [95]:
#look at the other data in the DF 
df_census.head()

Unnamed: 0,Fact,Fact Note,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,"Population estimates, July 1, 2016, (V2016)",,4863300,741894,6931071,2988248,39250017,5540545,3576452,952065,...,865454.0,6651194.0,27862596,3051217,624594,8411808,7288000,1831102,5778708,585501
1,"Population estimates base, April 1, 2010, (V2...",,4780131,710249,6392301,2916025,37254522,5029324,3574114,897936,...,814195.0,6346298.0,25146100,2763888,625741,8001041,6724545,1853011,5687289,563767
2,"Population, percent change - April 1, 2010 (es...",,1.70%,4.50%,8.40%,2.50%,5.40%,10.20%,0.10%,6.00%,...,0.063,0.048,10.80%,10.40%,-0.20%,5.10%,8.40%,-1.20%,1.60%,3.90%
3,"Population, Census, April 1, 2010",,4779736,710231,6392017,2915918,37253956,5029196,3574097,897934,...,814180.0,6346105.0,25145561,2763885,625741,8001024,6724540,1852994,5686986,563626
4,"Persons under 5 years, percent, July 1, 2016, ...",,6.00%,7.30%,6.30%,6.40%,6.30%,6.10%,5.20%,5.80%,...,0.071,0.061,7.20%,8.30%,4.90%,6.10%,6.20%,5.50%,5.80%,6.50%


In [90]:
#it's numbers related to the question being asked(fact) and the fact note apear to have alot of missing vlues 
#lets see there types
df_census.dtypes

Fact              object
Fact Note         object
Alabama           object
Alaska            object
Arizona           object
Arkansas          object
California        object
Colorado          object
Connecticut       object
Delaware          object
Florida           object
Georgia           object
Hawaii            object
Idaho             object
Illinois          object
Indiana           object
Iowa              object
Kansas            object
Kentucky          object
Louisiana         object
Maine             object
Maryland          object
Massachusetts     object
Michigan          object
Minnesota         object
Mississippi       object
Missouri          object
Montana           object
Nebraska          object
Nevada            object
New Hampshire     object
New Jersey        object
New Mexico        object
New York          object
North Carolina    object
North Dakota      object
Ohio              object
Oklahoma          object
Oregon            object
Pennsylvania      object


In [102]:
#that's not ok as the number are treated as strings we will need to clean this up 
#now lets see the missing values for each row 
df_census.isnull().sum()

Fact               5
Fact Note         57
Alabama           20
Alaska            20
Arizona           20
Arkansas          20
California        20
Colorado          20
Connecticut       20
Delaware          20
Florida           20
Georgia           20
Hawaii            20
Idaho             20
Illinois          20
Indiana           20
Iowa              20
Kansas            20
Kentucky          20
Louisiana         20
Maine             20
Maryland          20
Massachusetts     20
Michigan          20
Minnesota         20
Mississippi       20
Missouri          20
Montana           20
Nebraska          20
Nevada            20
New Hampshire     20
New Jersey        20
New Mexico        20
New York          20
North Carolina    20
North Dakota      20
Ohio              20
Oklahoma          20
Oregon            20
Pennsylvania      20
Rhode Island      20
South Carolina    20
South Dakota      20
Tennessee         20
Texas             20
Utah              20
Vermont           20
Virginia     

In [129]:
#the fixed number 20 is abit wired to me so i wanted to see if they all in the same postion
#i picked 2 random states and checked and lets see 
(df_census.California.isnull() == df_census.Wyoming.isnull()).all()


True

In [153]:
#ok lets check all of the nan valus for the state are they the same 
c = 2
for c in range(2,51):
    if (df_census.iloc[:,c].isnull() != df_census.iloc[:,c+1].isnull()).all():
        print('i was wrong')

In [161]:
#now lets see what is in there for the fact and note
df_census.iloc[:,0:2][df_census.California.isnull()]

Unnamed: 0,Fact,Fact Note
65,,
66,NOTE: FIPS Code values are enclosed in quotes ...,
67,,
68,Value Notes,
69,1,Includes data not distributed by county.
70,,
71,Fact Notes,
72,(a),Includes persons reporting only one race
73,(b),"Hispanics may be of any race, so also are incl..."
74,(c),Economic Census - Puerto Rico data are not com...


Unnamed: 0,Fact,Fact Note,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
65,,,,,,,,,,,...,,,,,,,,,,
66,NOTE: FIPS Code values are enclosed in quotes ...,,,,,,,,,,...,,,,,,,,,,
67,,,,,,,,,,,...,,,,,,,,,,
68,Value Notes,,,,,,,,,,...,,,,,,,,,,
69,1,Includes data not distributed by county.,,,,,,,,,...,,,,,,,,,,
70,,,,,,,,,,,...,,,,,,,,,,
71,Fact Notes,,,,,,,,,,...,,,,,,,,,,
72,(a),Includes persons reporting only one race,,,,,,,,,...,,,,,,,,,,
73,(b),"Hispanics may be of any race, so also are incl...",,,,,,,,,...,,,,,,,,,,
74,(c),Economic Census - Puerto Rico data are not com...,,,,,,,,,...,,,,,,,,,,


In [101]:
# the number 20 is not normal to see lets investigate more (fbi style XD )
df_census[df_census.isnull()].head(57)

Unnamed: 0,Fact,Fact Note,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


Fact               object
Fact Note          object
Alabama           float64
Alaska            float64
Arizona           float64
Arkansas          float64
California        float64
Colorado          float64
Connecticut       float64
Delaware          float64
Florida           float64
Georgia           float64
Hawaii            float64
Idaho             float64
Illinois          float64
Indiana           float64
Iowa              float64
Kansas            float64
Kentucky          float64
Louisiana         float64
Maine             float64
Maryland          float64
Massachusetts     float64
Michigan          float64
Minnesota         float64
Mississippi       float64
Missouri          float64
Montana           float64
Nebraska          float64
Nevada            float64
New Hampshire     float64
New Jersey        float64
New Mexico        float64
New York          float64
North Carolina    float64
North Dakota      float64
Ohio              float64
Oklahoma          float64
Oregon      

In [51]:
for i, v in enumerate(df_gun.columns):
    print(i, v)

0 month
1 state
2 permit
3 permit_recheck
4 handgun
5 long_gun
6 other
7 multiple
8 admin
9 prepawn_handgun
10 prepawn_long_gun
11 prepawn_other
12 redemption_handgun
13 redemption_long_gun
14 redemption_other
15 returned_handgun
16 returned_long_gun
17 returned_other
18 rentals_handgun
19 rentals_long_gun
20 private_sale_handgun
21 private_sale_long_gun
22 private_sale_other
23 return_to_seller_handgun
24 return_to_seller_long_gun
25 return_to_seller_other
26 totals


In [74]:
#this a bit tricky lets try and see more of this df 
df_gun.dtypes
# now lets try and group them as we can see more 

month                         object
state                         object
permit                       float64
permit_recheck               float64
handgun                      float64
long_gun                     float64
other                        float64
multiple                       int64
admin                        float64
prepawn_handgun              float64
prepawn_long_gun             float64
prepawn_other                float64
redemption_handgun           float64
redemption_long_gun          float64
redemption_other             float64
returned_handgun             float64
returned_long_gun            float64
returned_other               float64
rentals_handgun              float64
rentals_long_gun             float64
private_sale_handgun         float64
private_sale_long_gun        float64
private_sale_other           float64
return_to_seller_handgun     float64
return_to_seller_long_gun    float64
return_to_seller_other       float64
totals                         int64
d

In [None]:
#obs that not be object
#lest convert them into floats 
#we start after fact and note 
col = df_census.iloc[:,2:].columns
for c in col:
    df_census[c] = df_census[c].str.extract('(\d+)').astype(float)

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!