# Python for Data Analytics 1
## Assignment 1 - Arturo Ford Sosa

**Setting up**

In [1]:
# This code opens the json file and assigns the data to a variable
import json

with open('wine.json', 'r') as file:
    data = json.load(file)

**Question 1:** 

_How many wine reviews are included in the dataset?_

In [2]:
# Checks length of file by object amount in list and prints result
print(f"There are {len(data)} reviews included in the dataset.")

There are 129971 reviews included in the dataset.


**Question 2:** 

_What's the length of the last review?_

In [3]:
# This code creates a function that splits the description string and counts the number of words contained within with the split() method.
# Then this value is added to the dictionary as length

def add_length(data:list) -> list:
    for rev in data:
        rev["length"] = len(rev["description"].split())
    return data

data = add_length(data)

print(f"The length of the last review is {data[-1]['length']} words.")

The length of the last review is 27 words.


**Question 3:** 

_How many different countries have wines reviewed in the dataset?_

In [4]:
# This code creates a function with an empty set, then adds all the countries in the json file to it.
# Afterwards the Nonetype is filtered and the results counted to answer the amount of individual countries present.
# Returns a list with all countries found.

def create_country_list(data:list) -> list:
    country_list = set()
    for rev in data:
        country_list.add(rev['country'])
    country_list = set(filter(None, country_list))
    return country_list
    
print(f"{len(create_country_list(data))} different countries have wines reviewed in the dataset.")

43 different countries have wines reviewed in the dataset.


**Question 4:** 

_Build a dictionary with the following structure:_

`{country: number of wines reviewed coming from that country}`


In [5]:
# This code takes the json file data and the list of countries to evaluate as inputs
# then obtains the amount of reviews by country and returns this info as a dictionary.

def wine_reviews_by_country(data:list, country_list:list) -> dict:
    country_revs = {}
    for country in country_list:
        amount = 0
        for rev in data:
            if rev['country'] == country:
                amount += 1
        country_revs[country] = amount
    return country_revs

wine_reviews_by_country(data, create_country_list(data))

{'Turkey': 90,
 'Cyprus': 11,
 'India': 9,
 'Brazil': 52,
 'New Zealand': 1419,
 'Chile': 4472,
 'Morocco': 28,
 'Croatia': 73,
 'Ukraine': 14,
 'Portugal': 5691,
 'Lebanon': 35,
 'Mexico': 70,
 'Serbia': 12,
 'Egypt': 1,
 'France': 22093,
 'Peru': 16,
 'Canada': 257,
 'Italy': 19540,
 'England': 74,
 'Australia': 2329,
 'Bosnia and Herzegovina': 2,
 'Germany': 2165,
 'Uruguay': 109,
 'Slovenia': 87,
 'Austria': 3345,
 'Macedonia': 12,
 'US': 54504,
 'Romania': 120,
 'Spain': 6645,
 'Israel': 505,
 'Greece': 466,
 'Georgia': 86,
 'Moldova': 59,
 'Luxembourg': 6,
 'Czech Republic': 12,
 'South Africa': 1401,
 'Argentina': 3800,
 'Switzerland': 7,
 'Armenia': 2,
 'Hungary': 146,
 'Slovakia': 1,
 'China': 1,
 'Bulgaria': 141}

**Question 5:** 

_Build a dictionary with the following structure:_

`{country: average points of wines coming from that country}`


In [6]:
# This function takes the data from the json file and a country list to evaluate
# then calculates the average points for wines by country, returning a dictionary.

def average_points_by_country(data:list, country_list:list) -> dict:
    country_avg_pts = {}
    for country in country_list:
        sum_points = 0
        country_hits = 0
        for rev in data:
            if rev['country'] == country:
                sum_points += int(rev['points'])
                country_hits += 1  
        avg_points = 0
        if country_hits != 0:
            avg_points = round(sum_points/country_hits,2)
        country_avg_pts[country] = avg_points
    return country_avg_pts

average_points_by_country(data, create_country_list(data))

{'Turkey': 88.09,
 'Cyprus': 87.18,
 'India': 90.22,
 'Brazil': 84.67,
 'New Zealand': 88.3,
 'Chile': 86.49,
 'Morocco': 88.57,
 'Croatia': 87.22,
 'Ukraine': 84.07,
 'Portugal': 88.25,
 'Lebanon': 87.69,
 'Mexico': 85.26,
 'Serbia': 87.5,
 'Egypt': 84.0,
 'France': 88.85,
 'Peru': 83.56,
 'Canada': 89.37,
 'Italy': 88.56,
 'England': 91.58,
 'Australia': 88.58,
 'Bosnia and Herzegovina': 86.5,
 'Germany': 89.85,
 'Uruguay': 86.75,
 'Slovenia': 88.07,
 'Austria': 90.1,
 'Macedonia': 86.83,
 'US': 88.56,
 'Romania': 86.4,
 'Spain': 87.29,
 'Israel': 88.47,
 'Greece': 87.28,
 'Georgia': 87.69,
 'Moldova': 87.2,
 'Luxembourg': 88.67,
 'Czech Republic': 87.25,
 'South Africa': 88.06,
 'Argentina': 86.71,
 'Switzerland': 88.57,
 'Armenia': 87.5,
 'Hungary': 89.19,
 'Slovakia': 87.0,
 'China': 89.0,
 'Bulgaria': 87.94}

**Question 6:** 

_What's the country that produces the wines with the highest average rating?_

In [7]:
# This code explores the output of previously defined functions 
# to find the highest average rating for all evaluated countries and display it.

country_avg_pts = average_points_by_country(data, create_country_list(data))
highest_rating = max(country_avg_pts.values())
highest_rating_country = max(country_avg_pts, key=country_avg_pts.get)
print(f"The country that produces the wines with the highest average rating is {highest_rating_country}, with a score of {highest_rating}.")

The country that produces the wines with the highest average rating is England, with a score of 91.58.


**Question 7:** 

_Update each wine's description by adding at the end of each description the following piece of text:_

`"This is a {designation} from {country} that scored {points} points"`

_What is the resulting description of the last review?_


In [8]:
# This function obtains the wine dataset and dynamically adds a summary of the review as a last sentence to the description.
# This function is also optimized to work with scenarios where any of the variables used contains a Nonetype.

def add_generic_description(data:list) -> list:
    data_w_generic = []
    for rev in data:
        designation = ''
        country = ''
        points = ''
        # The first 'if' below is NOT the best way to check if the dataset has already been processed.
        # However, I can't find a check for this that is within the scope of what we have studied in class
        # An alternative would probably be to have another key-value pair that specifies processing, but this is not asked for in the question.
        if rev['description'].find('that scored') != -1:
            data_w_generic.append(rev)
            continue
        if rev['designation'] is None and rev['country'] is None and rev['points'] is None:
            data_w_generic.append(rev)
            continue
        if rev['designation'] is not None:
            designation = f'This is a {rev["designation"]} '
        else:
            designation = 'This is a wine '
        if rev['country'] is not None:
            country = f'from {rev["country"]} '
        if rev['points'] is not None:
            points = f'that scored {rev["points"]} points'
        rev['description'] = rev['description'] + ' ' + designation + country + points + '.'
        data_w_generic.append(rev)
    return data_w_generic

print(f'The resulting description of the last review is: \n{add_generic_description(data)[-1]["description"]}')

The resulting description of the last review is: 
Big, rich and off-dry, this is powered by intense spiciness and rounded texture. Lychees dominate the fruit profile, giving an opulent feel to the aftertaste. Drink now. This is a Lieu-dit Harth Cuvée Caroline from France that scored 90 points.


**Question 8:** 

_What's the proportion of wine tasters that have a Twitter account?_


In [9]:
# This functions receives the wine dataset as a list and outputs how many reviewers have Twitter accounts
# Also has functionality where it only double-counts the Twitter handles of those reviewers who share their accounts.

def twitter_tasters_proportion(data:list) -> str:
    tasters_w_handles = []
    tasters = 0
    handles = 0
    for rev in data:
        if rev['taster_name'] is not None or rev['taster_twitter_handle'] is not None:
            tasters_w_handles.append([rev['taster_name'], rev['taster_twitter_handle']])
    tasters_w_handles = list(set(tuple(i) for i in tasters_w_handles))
    for row in tasters_w_handles:
        if row[0] is not None:
            tasters += 1
        if row[1] is not None:
            handles += 1
    proportion = round((handles/tasters)*100,2)
    return f"The proportion of wine tasters that have a Twitter account is {proportion}%, given that there are {handles} accounts and {tasters} tasters."

print(twitter_tasters_proportion(data))

The proportion of wine tasters that have a Twitter account is 84.21%, given that there are 16 accounts and 19 tasters.
