<a href="https://colab.research.google.com/github/SurajKande/python-pandas/blob/master/python_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#pandas Basics

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python.

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary

In [0]:
# Pre-defined lists

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]


# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict={'country':names,
          'drives_right':dr,
          'cars_per_cap':cpc }

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Specify row labels of cars
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels

Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, 


to import CSV data into Python as a Pandas DataFrame you can use read_csv().

In [0]:
#to read data from a csv file
df = pd.read_csv('cars.csv')

index_col, an argument of read_csv(), that you can use to specify which column  
in the CSV file should be used as a row label

In [0]:
df = pd.read_csv('cars.csv',index_col=0)

we can index and select Pandas DataFrames in many different ways.              
The simplest way is to use Square brackets.

In [0]:
# Print out country column as Pandas Series
print(df["country"])

# Print out country column as Pandas DataFrame
print(df[["country"]])

# Print out DataFrame with country and drives_right columns
print(df[["country", "drives_right"]])

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

 The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

In [0]:
# Print out first 3 observations
print(df[0:3])

# Print out fourth, fifth and sixth observation
print(df[3:6])

In [0]:
# to read first 5 rows of a dataframe
df.head()

In [0]:
# to read last 5 rows of a dataframe
df.tail()

In [0]:
#to print out the  columns index
df.columns

In [0]:
#to know the number of rows and columns of the data
df.shape

In [0]:
#to get additional info
df.info()

In [0]:
#to remove rows with conditions
syntax: df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

#example
remove_indexs = df[df['Duration'] == '2016-17'].index      # will get the indexs of rows with Duraion = 2016-17
df.drop(remove_indexs, inplace = True)                       

other methods to access the data farme is using **loc and iloc**


With loc and iloc you can do practically any data selection operation on DataFrames you can think of.

 **loc** is label-based, which means that you have to specify rows and columns based on their row and column labels. 
 
 **iloc** is integer index based, so you have to specify rows and columns by their integer index

In [0]:
# Print out observation for Japan
print(df.loc['JPN'])    #using loc to print 
print(df.iloc[2])       #using iloc to print 

# Print out observations for Australia and Egypt
print(df.loc[["AUS","EG"]])
print(df.iloc[[1,6]])

# Print out drives_right value of Morocco
print(df.loc["MOR"]["drives_right"])

# Print sub-DataFrame
print(df.loc[["RU", "MOR"], ["country", "drives_right"]])


# Print out drives_right column as Series
print(df["drives_right"])

# Print out drives_right column as DataFrame
print(df[["drives_right"]])

# Print out cars_per_cap and drives_right as DataFrame
print(df[["cars_per_cap", "drives_right"]])

In [0]:
# Extract drives_right column as Series
dr = df['drives_right']

# Use dr to subset dataframe of cars
sel = df[dr]
    #or
sel = df[df['drives_right']]  # one liner no need of the variable
#OUTPUT:
#              cars_per_cap         country    drives_right
#          US            809   United States          True
#          RU            200          Russia          True
#          MOR            70         Morocco          True
#          EG             45           Egypt          True

# Create car_maniac: observations that have a cars_per_cap over 500
cars_maniac = cars[cars["cars_per_cap"] > 500]

 np.logical_and(), np.logical_or() and np.logical_not(),
 the Numpy variants of the and, or and not operators,
 
  those can also use them on Pandas Series to do more advanced filtering operations.

In [0]:

# Create medium: observations with cars_per_cap between 100 and 500
medium = df[np.logical_and(df["cars_per_cap"]>100,df["cars_per_cap"]<500)]

Looping over dataframe

Iterating over a Pandas DataFrame is typically done with the iterrows() method. Used in a for loop, every observation is iterated over and on every iteration the row label and actual row contents are available:

```
for lab, row in brics.iterrows() :
    ...
```

The row data that's generated by iterrows() on every run is a Pandas Series

In [0]:
# Iterate over rows of cars
for label,row_content in df.iterrows():
    print(label)
    print(row_content)

If you want to add a column to a DataFrame by calling a function on another    
 column,  use apply(), instead of looping over the entire dataframe 

In [0]:
# Use .apply(str.upper)
df["COUNTRY"] = df["country"].apply(str.upper)
print(df[["country","COUNTRY"]])
#OUTPUT:
    #           country        COUNTRY
    # US   United States  UNITED STATES
    # AUS      Australia      AUSTRALIA
    # JPN          Japan          JAPAN
    # IN           India          INDIA
    # RU          Russia         RUSSIA
    # MOR        Morocco        MOROCCO
    # EG           Egypt          EGYPT

In [0]:
# example on a tweets.csv datasset 

##Q: To count the languages used in the tweets

#Consider the tweeter dataset in tweets.csv
# Import pandas
import pandas as pd
df = pd.read_csv('tweets.csv')

# Initialize an empty dictionary: langs_count
langs_count = {}

col = df['lang']
for entry in col:
    # If the language is in langs_count, add 1 
    if entry in langs_count.keys():
        langs_count[entry]+=1
    # Else add the language to langs_count, set the value to 1
    else:
        langs_count[entry] = 1

#to print the 1st row since those mostly are lables
print(df.head())
# Print the populated dictionary
print(langs_count)

In [0]:
#to read only 5 rows from the fine use nrows,..header to mention if header is there or not
df = pd.read_csv('twets.csv',nrows=5, header=None )