In [None]:
import numpy as np
import pandas as pd

In [None]:
# we will use the wine dataset to analyze and learn basics of pandas
wine_reviews = pd.read_csv("../input/wine-reviews-dataset/winemag-data_first150k.csv")
wine_reviews.head(5)

 now we can see that we dont need to define two indexes for our data and hence we will remove the first one using the below command



In [None]:
wine_reviews = pd.read_csv("../input/wine-reviews-dataset/winemag-data_first150k.csv", index_col=0)
wine_reviews.head(5)

In [None]:
# now we will see how to extract columsn
wine_reviews.price

Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.

Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
wine_reviews.iloc[0]

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following



In [None]:
wine_reviews.iloc[:,0]

In [None]:
wine_reviews.iloc[3,2]

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do

In [None]:
wine_reviews.iloc[:10,:2]

In [None]:
wine_pp = wine_reviews.iloc[:,2:5]
wine_pp.head(5)

In [None]:
# now i want to retrieve the country designation and province for all records
wine_reviews.iloc[:,[0,4,5]]

Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in reviews, we would now do the following:

In [None]:
wine_reviews.loc[0,['country', 'price']]

In [None]:
wine_reviews.loc[:,['country', 'price','points']]

When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:

In [None]:
wine_reviews.loc[wine_reviews.country == 'US']

In [None]:
wine_reviews.loc[(wine_reviews.country == 'US')  & (wine_reviews.price > 30)]

In [None]:
# We want to find the wines of USA which are better than average
wine_reviews.loc[(wine_reviews.country == 'US') & (wine_reviews.points > wine_reviews.points.mean(axis=0))]

In [None]:
wine_reviews.points.mean(axis=0)

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France

In [None]:
wine_reviews.loc[(wine_reviews.country.isin(['US','Spain'])) & (wine_reviews.price > 100)]

# Summary Functions and Maps in Pandas

In [None]:
# the first function is the describe function to get an idea about the column

wine_reviews.price.describe()

In [None]:
wine_reviews.country.describe()

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method

In [None]:
wine_reviews.country.value_counts()

Maps
A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

In [None]:
# we will subtract mean price from the price of every wine
wine_price = wine_reviews.price.mean()
print("The mean price of wine is : ",wine_reviews.price.mean() )
wine_reviews.price.map(lambda p: p - wine_price)

Groupwise analysis¶
One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:

In [None]:
wine_reviews.groupby('price').price.count()

In [None]:
# finding the chepaest price of the wine in each country
wine_reviews.groupby('country').price.min()

In [None]:
wine_reviews.groupby(['country','province']).apply(lambda df: df.loc[df.points.idxmax()])

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows

In [None]:
wine_reviews.groupby(['country']).price.agg([len, min, max,'mean','sum','count']).sort_values(['max','sum'],ascending=False)

Multi-indexes
In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.
Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

In [None]:
countries_reviewed = wine_reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index()

Sorting
Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this

In [None]:
countries_reviewed.sort_values('len')

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function

In [None]:
wine_reviews.points.astype('float')

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly

In [None]:
wine_reviews[pd.isnull(wine_reviews.country)]

In [None]:
wine_reviews['region_2'] = wine_reviews.region_2.fillna('Unknown')

In [None]:
# checking to see if all the null values for region_2 has gone

wine_reviews[pd.isnull(wine_reviews.region_2)]

In [None]:
wine_reviews['country'] = wine_reviews.country.fillna('Unknown')

In [None]:
# checking to see if all the null values for region 2 has gone

wine_reviews[pd.isnull(wine_reviews.country)]

In [None]:
wine_reviews['country'] = wine_reviews.country.replace('Unknown', 'Invalid')

In [None]:
wine_reviews.loc[wine_reviews.country == 'Invalid']

The first function we'll introduce here is rename(), which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do

In [None]:
wine_reviews.rename(columns = {'points':'score'})