##### Data Structures in Pandas Library
Pandas provide two data structures for manipulating data:
- Series
- DataFrame
###### Pandas Series
- A pandas series is a one-dimensional labeled array (a column in an Excel sheet) capable of holding data of any type i.e. intenger, float, strings, ....
- The axes labels are collectively called indexes
###### Creating a Series
-Pandas Series is created by loading the datasets from existing storage( database, csv or Excel file). Further it can be created from lists, dictionaries and scalar values etc.
###### Pandas DataFrame
- Pandas DataFrame is a two-dimentional data structure with labeled axes(rows and columns)
###### Creating DataFrame
- Pandas DataFrame is created by loading the dataset from existing storge(database, csv or excel). It can also be created from lists, dictionaries or a liat of dictionaries etc.

##### Methods
###### describe() method
Pandas **describe()** is used to view some basic statistical details like percentile, mean, std etc of a data from a series of numeric values.
- **Syntax**: DataFrame.describe(percentiles = None, include = None, exclude = None)
- **Parameters**:
   * **percentile**: list like data type of numbers between 0 -1 to return the respective percentile
   * **include**: list of data types to be included while describing dataframe. Default is None.
   * **exclude**: list of data types to be excluded while describing dataframe.Default is None.

- **Return type:**: Statistical summary of data frame.


In [2]:
# Example
import pandas as pd
data = pd.read_csv('nba.csv')
print(data.head())

            Name            Team  Number Position   Age Height  Weight  \
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0   
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0   
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0   
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0   
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0   

             College     Salary  
0              Texas  7730337.0  
1          Marquette  6796117.0  
2  Boston University        NaN  
3      Georgia State  1148640.0  
4                NaN  5000000.0  


In [8]:
df = data['Age'].head()
df.unique()    # To print all the unique values

array([25., 27., 22., 29.])

In [10]:
df.nunique()  # To give the total count of unique values

4

In [15]:
# Using Describe function
print(data.describe()) # It gives several statistical measures, including mean, std, quartiles etc

           Number         Age      Weight        Salary
count  457.000000  457.000000  457.000000  4.460000e+02
mean    17.678337   26.938731  221.522976  4.842684e+06
std     15.966090    4.404016   26.368343  5.229238e+06
min      0.000000   19.000000  161.000000  3.088800e+04
25%      5.000000   24.000000  200.000000  1.044792e+06
50%     13.000000   26.000000  220.000000  2.839073e+06
75%     25.000000   30.000000  240.000000  6.500000e+06
max     99.000000   40.000000  307.000000  2.500000e+07


##### Explanation of the description of numerical columns
- **Count**: Total number of Non-Empty values
- **Mean**: Mean of the column values
- **std**: Standard Deviation of the column values
- **min**: Minimum value fro the column
- **25%** : 25 percentile
- **50**: 50 percentile
- **75**: 75 percentile
- **Max**: Maximum value from the column

In [17]:
# pandas describe() behavior for numeric dtypes
import pandas as pd
data = pd.read_csv('nba.csv')

# removing null values to avoid errors
data.dropna(inplace = True)

# percentile list
percentile = [.20, .40, .60, .80]

# list of dtypes to include
include = ['object', 'float', 'int']

# calling describe method
desc = data.describe(percentiles = percentile, include = include)

#display
desc   # For the columns with strings, NaN was returned for numeric operations


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
count,364,364,364.0,364,364.0,364,364.0,364,364.0
unique,364,30,,5,,17,,115,
top,Avery Bradley,New Orleans Pelicans,,SG,,6-9,,Kentucky,
freq,1,16,,87,,49,,22,
mean,,,16.82967,,26.615385,,219.785714,,4620311.0
std,,,14.994162,,4.233591,,24.793099,,5119716.0
min,,,0.0,,19.0,,161.0,,55722.0
20%,,,4.0,,23.0,,195.0,,947276.0
40%,,,9.0,,25.0,,212.0,,1638754.0
50%,,,12.0,,26.0,,220.0,,2515440.0


In [18]:
# Describing series of strings
import pandas as pd

# making data frame
data = pd.read_csv('nba.csv')

# removing null values to avoid errors
data.dropna(inplace = True)

# Calling describe method
desc = data['Name'].describe()

# display
desc

count               364
unique              364
top       Avery Bradley
freq                  1
Name: Name, dtype: object

##### Dealing with Rows and Columns in Pandas DataFrame
We can perform basic operations on rows/columns like selecting, deleting, adding and renaming.
###### Column Selection
In order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [20]:
# Example
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
       }
df = pd.DataFrame(data)

print(df[['Name','Qualification']])





     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


###### Other methonds of selecting multiple columns in a DataFrame
- using Basic Method - the one described above.
- using loc[]
- using iloc[]
- using .ix

In [26]:
# Another example of using basic method
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
df = pd.DataFrame(data)
df[df.columns[1:4]] # using DataFrame slicing

Unnamed: 0,Age,Address,Qualification
0,27,Delhi,Msc
1,24,Kanpur,MA
2,22,Allahabad,MCA
3,32,Kannauj,Phd


##### select multiple columns in a pandas DataFrame using loc[]


In [48]:
# Example
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
df = pd.DataFrame(data)

# select three rows and two columns                
df.loc[1:3,['Name', 'Address']]


Unnamed: 0,Name,Address
1,Princi,Kanpur
2,Gaurav,Allahabad
3,Anuj,Kannauj


In [38]:
df.loc[[1:3]

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Delhi,Msc
1,Princi,24,Kanpur,MA
2,Gaurav,22,Allahabad,MCA
3,Anuj,32,Kannauj,Phd


##### Select one to another columns


In [39]:
# In this example, loc[] method is used to select column name i.e. 'Name' to 'Address'
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
pd = pd.DataFrame(data)

df.loc[0:1, 'Name':'Address']

Unnamed: 0,Name,Age,Address
0,Jai,27,Delhi
1,Princi,24,Kanpur


In [49]:
# selecting the first row and all columns
df.loc[0, :]

Name               Jai
Age                 27
Address          Delhi
Qualification      Msc
Name: 0, dtype: object

##### Select multiple columns in pandas using iloc[]