<a href="https://colab.research.google.com/github/ShirsaM/My-Google-Colab/blob/main/Pandas_Exercise_3_Summary_Functions_and_Maps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summary functions
Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.

**`describe` method**

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data its different.

In [3]:
import pandas as pd
import numpy as np

In [None]:
reviews.points.describe()
#output - (for NUMERICAL DATA)
#count    129971.000000
#mean         88.447138
             ...      
#75%          91.000000
#max         100.000000
#Name: points, Length: 8, dtype: float64


reviews.taster_name.describe()
#output - (for STRING DATA)
#count         103727
#unique            19
#top       Roger Voss
#freq           25514
#Name: taster_name, dtype: object

**`mean` method**

to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function

In [None]:
reviews.points.mean()


**`unique` method**

To see a list of unique values we can use the unique() function:

In [None]:
reviews.taster_name.unique()
#output - array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt','Alexander Peartree', 'Michael Schachner',...])

**`value_counts` method** 

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method

In [None]:
reviews.taster_name.value_counts()
#output - Roger Voss           25514
#Michael Schachner    15134
                     ...  
#Fiona Adams             27
#Christina Pickard        6
#Name: taster_name, Length: 19, dtype: int64

## Maps

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

1) `map()` is the first, and slightly simpler one.

In [None]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

# The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

# apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')


#Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.


In [None]:
# Pandas will also understand what to do if we perform these operations between Series of equal length.

reviews.country + " - " + reviews.region_1

#output- 
#0            Italy - Etna
#1                     NaN
#               ...       
#129969    France - Alsace
#129970    France - Alsace
#Length: 129971, dtype: object

In [None]:
# question - 1
#What is the median of the points column in the reviews DataFrame?
median_points = reviews.points.median()

# question - 2
#What countries are represented in the dataset? (Your answer should not include any duplicates.)
countries = reviews.country.unique()

# question - 3
#How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.
reviews_per_country = reviews.country.value_counts()

# question - 4
#Create variable centered_price containing a version of the price column with the mean price subtracted.
centered_price = reviews.price - reviews.price.mean()

# question - 5
#I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

# question - 6
#There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

# question - 7
#We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
#Create a series star_ratings with the number of stars corresponding to each review in the dataset.
def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')
