In [7]:
%%HTML
<h1>Intro to data structure</h1>
<br/>
We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. 
To get started, import numpy and load pandas into your namespace:

In [9]:
import numpy as np
import pandas as pd

In [10]:
%%HTML
<h2>Series</h2>
<br/><br/>
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

    

In [11]:
s = pd.Series([1,2,3,4,5], ['a','b','c','d','e'])
print(s)



a    1
b    2
c    3
d    4
e    5
dtype: int64


In [17]:
print(pd.Series(np.zeros(5)))
""" only one dim : pd.Series(np.zeros(1,5)) => not working """


0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64


' only one dim : pd.Series(np.zeros(1,5)) => not working '

In [23]:
myDict = {'banana' : 5, 'apple':3, 'orange':7}

""" No index provided => unwrapped"""
fruits = pd.Series(myDict);
""" Index provided => unwrapped"""
shoppingCart = pd.Series(myDict, 
            ['banana', 'banana', 'orange', 'apple', 'pineapple'])

print(fruits)
print("\n")
print(shoppingCart)



apple     3
banana    5
orange    7
dtype: int64


banana       5.0
banana       5.0
orange       7.0
apple        3.0
pineapple    NaN
dtype: float64


In [20]:
print(pd.Series(5, index=['a','b','c']))

a    5
b    5
c    5
dtype: int64


In [25]:
print(shoppingCart)
print("\n")
print(shoppingCart[:2])
print("\n")
print(shoppingCart[shoppingCart>shoppingCart.median()])
print("\n")
print(shoppingCart[[1,3]])
print("\n")
print(np.exp(shoppingCart))



banana       5.0
banana       5.0
orange       7.0
apple        3.0
pineapple    NaN
dtype: float64


banana    5.0
banana    5.0
dtype: float64


orange    7.0
dtype: float64


banana    5.0
apple     3.0
dtype: float64


banana        148.413159
banana        148.413159
orange       1096.633158
apple          20.085537
pineapple            NaN
dtype: float64


In [34]:
shoppingCart['banana'] = 5
print(shoppingCart['banana'])
print("-----")
shoppingCart['banana'] = 111
print("-----")
print(shoppingCart['banana'])
print("-----")
print('banana' in shoppingCart)
print("-----")
try :
    print(shoppingCart['inexistingKey'])
except:
    print("error")


banana    5.0
banana    5.0
dtype: float64
-----
-----
banana    111.0
banana    111.0
dtype: float64
-----
True
-----
error


In [36]:

print(shoppingCart)
print("\n")
print(shoppingCart+shoppingCart)
print("\n")
print(shoppingCart*2)




banana       111.0
banana       111.0
orange         7.0
apple          3.0
pineapple      NaN
dtype: float64


banana       222.0
banana       222.0
orange        14.0
apple          6.0
pineapple      NaN
dtype: float64


banana       222.0
banana       222.0
orange        14.0
apple          6.0
pineapple      NaN
dtype: float64


In [37]:
%%HTML
<h2>DataFrame</h2><br/>
<br/>
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or Series<br/>
<br/>
2-D numpy.ndarray<br/>
Structured or record ndarray<br/>
A Series<br/>
Another DataFrame<br/>
<br/>
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.




In [63]:
myDict = {
    'clientName' : 'hugo',
    'shoppingCart' : shoppingCart,
    'shoppingCart2' : shoppingCart,
    'totalItems' : 1
}
""" 
all series should have the same index
could not add embedded dict (as shoppingCart => transform to Series)
"""

df = pd.DataFrame(myDict);

print(df);
print("\n");
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

"""this line does not work : print(pd.DataFrame(myDict, index=['banana']))"""

print(pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']))


          clientName  shoppingCart  shoppingCart2  totalItems
banana          hugo         111.0          111.0           1
banana          hugo         111.0          111.0           1
orange          hugo           7.0            7.0           1
apple           hugo           3.0            3.0           1
pineapple       hugo           NaN            NaN           1


   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN


In [72]:
matrix = np.zeros(2)
zerosDf = pd.DataFrame(matrix, index=["one","two"], columns=['myCol'])
print(zerosDf)




     myCol
one    0.0
two    0.0


In [107]:
%%HTML
<h2>STOP AT :</h2> "From a list of dicts"
<a href="https://pandas.pydata.org/pandas-docs/stable/dsintro.html">tutorial</a>


In [110]:
%%HTML
<h2>Optional homeworks</h2>
For this initial homework we will be training with the gunshot deaths dataset. Get it here. The goal of this optional homework is to use an IPython Notebook to reproduce the results reported in the visualization at the top of the article (e.g., "nearly two-thirds of gun deaths are suicides"). It's not necessary to generate visualizations for the results -- numbers should be more than enough to convince yourself that you were able to reproduce the results of that article.

In [106]:

file = pd.read_csv("full_data.csv")

print(file.columns)
intent = file['intent']

gunshot_type = intent[(intent != 'None selected') & (intent != 'Unknown')]

print(file['intent'].value_counts())

suicides = gunshot_type[gunshot_type == 'Suicide']

total_suicides = len(suicides.index)
suicideRate = total_suicides/len(gunshot_type.index)

print(suicideRate)








Index(['Unnamed: 0', 'year', 'month', 'intent', 'police', 'sex', 'age', 'race',
       'hispanic', 'place', 'education'],
      dtype='object')
Suicide         63175
Homicide        35176
Accidental       1639
Undetermined      807
Name: intent, dtype: int64
0.6267485465981468
