# Basic Python Practices

#### Exercise 00. Variable Declaration

Define the following variables with values of your choice:

- A variable that stores your name (text).
- A variable that stores your age (integer).
- A variable that indicates if you like programming (true or false).
- A variable that stores your average grade (decimal number).

In [None]:
# Declare your variables here.

#greeting = "Hello world"
STUDENT_NAME = "N. Winocur"
student_age_fictional = 700
LIKES_PROGRAMMING = True
your_average_grade = 9000.1

- Create a list with your five favorite numbers and print it.

In [None]:
# Favorite numbers list
import math
favored_numbers_list = [0,1,math.pi,69,1337]
print(favored_numbers_list)

- Create a dictionary that stores a student's information and print it:

        - Name
        - Age
        - Final grade

In [None]:
student_information = {"Name":STUDENT_NAME, "Age":student_age_fictional, "Final grade":your_average_grade}
print(student_information)

#### Exercise 01. Basic data analysis with native Python structures.  
Create a list with the grades of 5 students: [8.5, 9.2, 7.8, 8.9, 10].

- Calculate the average of the grades.

In [None]:
import numpy as np
student_grades = [8.5, 9.2, 7.8, 8.9, 10]
their_average = np.mean(student_grades)
print(their_average)

- Find the highest and lowest grade.

In [None]:
highest_grade = np.max(student_grades)
lowest_grade = np.min(student_grades)
print(highest_grade)
print(lowest_grade)

# Real estate data cleaning with Pandas for efficient analysis

This is a real dataset that was downloaded using web scraping techniques. The data contains records from **Fotocasa**, one of the most popular real estate websites in Spain. Please do not perform web scraping unless it is for academic purposes.

The dataset was downloaded a few years ago by Henry Navarro, and no economic benefit was obtained from it.

It contains thousands of real house listings published on the website www.fotocasa.com. Your goal is to extract as much information as possible with the data science knowledge you have acquired so far.

Let's get started!

- First, let's read and explore the dataset.

In [None]:
import pandas as pd

# Leer el archivo CSV
ds = pd.read_csv('assets/real_estate.csv', sep=';') # Este archivo CSV contiene puntos y comas en lugar de comas como separadores
ds # mostramos todo

- Display the first rows of the CSV file.

In [None]:
print(ds.head(10)) #default shows first five rows; showing ten confirms we have at least one NaN value

Perfect, this was a small practice. Now let's begin with the real exercises!

#### Exercise 01. What is the most expensive house in the entire dataset? (★☆☆)

Print the address and price of the selected house. For example:

`The house located at Calle del Prado, Nº20 is the most expensive, and its price is 5000000 USD.`

In [None]:
#print(ds.columns)
expensive_house_price = ds.price.max()
expensive_house_id = ds.price.idxmax()
expensive_house_address = ds.address[expensive_house_id]

print(f"The house located at {expensive_house_address} is the most expensive, and its price is {expensive_house_price} USD.")

#### Exercise 02. What is the cheapest house in the dataset? (★☆☆)

This exercise is similar to the previous one, except now we are looking for the house with the lowest price. Remember to print the address and price of the selected house. For example:

`The house located at Calle Alcalá, Nº58 is the cheapest, and its price is 12000 USD.`


In [None]:
inexpensive_house_price = ds.price[ds.price>0].min()
inexpensive_house_id = ds.price[ds.price>0].idxmin()
inexpensive_house_address = ds.address[inexpensive_house_id]

print(f"The house located at {inexpensive_house_address} is the cheapest, and its price is {inexpensive_house_price} USD.")

#### Exercise 03. What is the largest and smallest house in the dataset? (★☆☆)

Print the address and area of the selected houses. For example:

`The largest house is located at Calle Gran Vía, Nº38, and its area is 5000 square meters.`

`The smallest house is located at Calle Mayor, Nº12, and its area is 200 square meters.`

This exercise is similar to the previous one, but we are looking for the largest and smallest houses based on their area.

In [None]:
largest_house_area = ds.surface.max()
largest_house_id = ds.surface.idxmax()
largest_house_address = ds.address[largest_house_id]

print(f"The largest house is located at {largest_house_address}, and its area is {largest_house_area} square meters.")


smallest_house_area = ds.surface[ds.surface>0].min()
smallest_house_id = ds.surface[ds.surface>0].idxmin()
smallest_house_address = ds.address[smallest_house_id]

print(f"The smallest house is located at {smallest_house_address}, and its area is {smallest_house_area} square meters.")

#### Exercise 04. How many unique populations are in the dataset? (★☆☆)

Count the number of unique populations in the 'level5' column and print the names of the populations separated by commas. For example:

`> print(populations)`

`population1, population2, population3, ...`

In [None]:
populations = ds.groupby("level5")
print(f"There are {populations.ngroups} unique populations in the dataset")

for pop in populations:
    print(f"{pop[0]}", end=", ")

#### Exercise 05. Does the dataset contain null values (NAs)? (★☆☆)

Print a boolean (`True` or `False`) to check if there are null values, followed by the columns that contain NAs.

In [None]:
columns_containing_na = []
is_na_found = False
for column_name, series in ds.items():
    # The "level#" columns look useful but columns like "level#Id" contain nothing but zeroes.  So just like the NaNs I see no reason to keep them.
    if series.isna().all() or (series ==0).all(): 
        is_na_found = True
        columns_containing_na.append(column_name)

print(f"NAs found? {is_na_found}")
print(columns_containing_na)


#### Exercise 06. Remove the null values (NAs) from the dataset, if applicable (★★☆)

After removing the null values, compare the size of the DataFrame before and after the removal.

In [None]:
size_before_removing_nulls = ds.size
print(f"Size before removing nulls was\t{size_before_removing_nulls}")
ds_without_nas = ds.dropna(axis=1, how="all")
print(f"Size after removing nulls is\t{ds_without_nas.size}")

#next will clear useless only-zeroes columns by replacing those values with NAs and then dropping the same way
ds_cleaned_up = ds.replace(0, np.nan).dropna(axis=1, how="all")
print(f"Size after removing both nulls and columns containing exclusively zeroes is\t{ds_cleaned_up.size}")

#### Exercise 07. What is the average price in the population of "Arroyomolinos (Madrid)"? (★★☆)

Print the value obtained from the 'level5' column.

In [None]:
level_five_location = "Arroyomolinos (Madrid)"
real_estate_there = ds_cleaned_up[ds_cleaned_up['level5'] == level_five_location]

print(f"Average price in {level_five_location} is\t{real_estate_there.price.mean()}")
print(f"Ignoring prices listed as zero (just in case), that average would be\t{real_estate_there[real_estate_there['price'] > 0].price.mean()}")


#### Exercise 08. Plot the histogram of prices for the population of "Arroyomolinos (Madrid)" and explain what you observe (★★☆)

Print the histogram of the prices and write a brief analysis of the plot in the Markdown cell.

In [None]:
import matplotlib.pyplot as plt

# Plot the histogram of prices
plt.figure(figsize = (10, 5))
plt.hist(real_estate_there["price"], bins="auto")
plt.title(f"Real estate prices in {level_five_location}")
plt.show()

##### Analysis of this plot:
- Majority of home sales were in the $250k to $375k range
- More-affordable homes were still reasonably common; there were about as many homes costing between $150k-$200k as there were between $200k-$250k
- There were a little more than half as many sales between $375k to $425k as there were under $200k
- Homes costing over $425k were much less common, perhaps even fewer closings in total than the ones selling between $375k to $425k