# Project Learn

*   Organize information using basic Python Structures
*   Import data from CSV file and clean it using `pandas` library
*   Create dta visualization : scatter and box plots
*   Use correlation to examine the relationship between two variables









# Appending to Lists





*   Appending Items

    *   add an item to a list that already exists using the `append` method.


In [None]:
price_usd.append(599)

# Zipping Items


    


   





*   To combine / or zip two lists together
     *   Use `zip` method
        *   create a new list that pairs the values in our price_usd list with our area_m2 list.
        *put one list inside another list, use `list`

In [None]:
new_list = zip(price_usd, area_m2)
zipped_list = list(new_list)

# Pandas Previes




* Import pandas library
* DataFrame
* Print object type
* Print shape

In [None]:
# Import pandas library, aliased as `pd`
import pandas as pd

# Declare variable `df_houses`
df_houses = pd.DataFrame(houses_columnwise)

# Print `df_houses` object type
print("df_houses type:", type(df_houses))

#df_houses type: <class 'pandas.core.frame.DataFrame'>

# Print `df_houses` shape
print("df_houses shape:", df_houses.shape)
#df_houses shape: (5, 4)


# Get output of `df_houses`
df_houses

# Importing CSV Files



* use`pd`as an *alias*
* use the`read_csv`method to create a DataFrame from a CSV file

In [None]:
import pandas as pd

df = pd.read_csv("data/colombia-real-estate-1.csv")
df.head()

# Inspecting a DataFrame using the shape, info and head in pandas


* use the `df.shape` method:
    * understanding the dimensionality of the DataFrame
* use the `info` method:
    * tells us all sorts of things about the DataFrame
* use `print`:
    *  see all the rows in our new DataFrame
* using the `head` method:
    * take a look at the first five rows

In [None]:
# understanding the dimensionality of the DataFrame
df.shape

# get a general idea of what the DataFrame contained
df.info()

# take a look at the first five rows
df.head()

#  Clean DataFrame


* Drop rows with missing values from a DataFrame using pandas.
  * use the `dropna` method:
    * drop rows with empty cells
    * use `inplace = True`:
      * want the original DataFrame updated without making a copy.
    * use `inplace=False`:
      * pandas will revert to default values

In [None]:
print("df shape before dropping rows", df.shape)
df.dropna(inplace=True)
print("df shape after dropping rows", df.shape)
df.head()



*   Replace string characters in a column using pandas.
    *   recasting:
        * using the `astype` method: recast the whole data frame
        * using the `astype()` method: only recast individual columns



In [None]:
# recast the whold data frame
print(df.info())
newdf = df.astype("str")
print(newdf.info())

# only recast individual columns
df["area_m2"] = df.area_m2.astype(int)





*   Create new columns derived from existing columns in a DataFrame using pandas
*   Drop a column from a DataFrame using pandas:
    * use the `drop` method:
      * drop a column by setting the `axis` argument to `columns`
      * drop a row by setting the `axis` argument to `index` , drop row 2



In [None]:
# Create new columns derived from existing columns
df3["price_m2"] = df3["price_usd"] / df3["area_m2"]

# drop a column
df2 = df.drop(columns=["price_mxn"], inplace = True)

# drop row 2
df2 = df.drop(2, axis="index")



*   Split the strings in one column to create another using pandas.
    *   use the `.str.split` method:
        * `expand`: telling pandas to make the DataFrame bigger
          * to create a new column without dropping any of the ones that already exist.



In [None]:
df3[["lat", "lon"]] = df3["lat-lon"].str.split(",", expand=True)



*   Concatenate DataFrames
  *   use the `concat` method:
      * to put our DataFrames together, using each DataFrame's name in a list
      



In [None]:
# Concatenate df1, df2, and df3
df = pd.concat([df1,df2,df3])



*   Saving a DataFrame as a CSV
    *   using the `to_csv` method:
        * setting the `index` argument to `False`:
            * the DataFrame index isn't included in the CSV file.



In [None]:
df.to_csv("data/small-df.csv", index=False)

# Exploratory Data Analysis



*   **Scatter Mapbox Plot**
    *   uses dots to represent values for two different numeric variables.
    *   Finding correlations between variables
    *   use the `usecols` argument to import columns from csv file




In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px

df1 = pd.read_csv("data/colombia-real-estate-3.csv", usecols=["area_m2", "price_usd"])
plt.scatter(df1["area_m2"], df1["price_usd"], color="r")
plt.xlabel("Property Area")
plt.ylabel("Price in US Dollars")
plt.title("Property Area vs Price in US Dollars");



*   **Series**
    *   Aggregate data in a Series using `value_counts` in pandas



In [None]:
import pandas as pd
df2 = pd.read_csv("data/colombia-real-estate-2.csv")
df2["department"].value_counts()



*   **Histogram of Area**
    *   Display **frequency** distribution of numerical data
    *   useful for detecting outliers
    *   Create a histogram using Matplotlib



In [None]:
import matplotlib.pyplot as plt
import pandas as pd


df = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["area_m2"])
plt.hist(df, bins=10, rwidth=0.9, color="b")
plt.title("The Area of Real Estate in Colombia")
plt.xlabel("Property Area")
plt.ylabel("Number of Properties")
plt.grid(axis="y", alpha=0.75);



*   **Boxplot of Area**
    *   Components of a Boxplot:
        * **Box**: The main part of the plot, which includes the interquartile range (IQR)
            * minumum
            * first quartile
            * median
            * third quartile
            * maximum
        * **Whiskers**:
          * Lines extending from the box to the minimum and maximum values within 1.5 times the IQR from the first and third quartiles, respectively.
        * **Outliers**:
          * Data points that fall outside the range of the whiskers.
          * These are often plotted as individual points.
    * These sections are called intervals, and the three lines that divide them are called quartiles
        * Each interval contains 25% of the observations in the data set.
          *  the box created by the first and third quartiles represents the middle 50% of observations.
    * Part of the purpose of making a boxplot is to find these outliers and discard them in future analyses.



In [None]:
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["area_m2"])
plt.boxplot(df["area_m2"])
plt.ylabel("Area [sq. meters]")
plt.title("Area in Square Meters");



*   normal distribution  (bell curve)
    * All other things being equal, we expect most outcomes to fall in the middle of the possible range
    * with a smaller number of outcomes on either side of the peak.

*   skewed distribution
    * the peak of the curve is shifted, or skewed, either to the right or the left of the distribution



# Group by State







*   Aggregate data using the `groupby` method in pandas
*   use the `get_group` method:
      * to see the properties in just one of the group
*   use the `groupby` method:
    * Create groups based on multiple categories
*   use the `.mean()` method:
    * find the average property area in each department







In [None]:
# use the groupby method:
df1=pd.read_csv("data/colombia-real-estate-1.csv")
dept_group = df1.groupby("department")

# use the get_group method:
dept_group1 = df1.groupby("department")
dept_group1.get_group("Santander")

# use groupby to calculate aggregations:
dept_group = df1.groupby("department")["area_m2"].mean()


# Create a bar chart using pandas



*   showing all the values of the categorical variables in the dataset
*   consist of an axis and a series of labeled horizontal or vertical bars.
*   Plots the frequency of different values of a variable or simply the values themselves
*   y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.



In [None]:
# Create bar chart from `mean_price_by_state` using pandas

mean_price_by_state.plot(
    kind = "bar",
    xlabel="State",
    ylabel="Price [USD]",
    title= "Mean House Price by State"
)

# Correlation




*   the relationship between two sets of data
*   calculate the relationship, the result is a correlation coefficient
    * Any value between -1 and 1
    * Values **greater than 0 **indicate a **positive correlation**
    * Values **below 0** indicate a **negative relationship**
    * The closer the coefficient value is to 1 or -1:
      * the stronger the correlation is;
    * the closer the coefficient value is to 0:
      * the weaker the relationship is.
*   use the `Series.corr` method to figure out correlation coefficient




In [None]:
area_m2 = df2["area_m2"]
price_cop = df2["price_cop"]
correlation = area_m2.corr(price_cop)
print(correlation)

# Subsets




*   Subset a DataFrame with a mask using pandas
  *   Another way to create subsets from a larger dataset is through masking
      * to filter out the data you're not interested in so that you can focus on the data you are
  * `mask` is a Series of Boolean values



In [None]:
import pandas as pd
df1 = pd.read_csv("data/colombia-real-estate-1.csv")
mask = df1["area_m2"] > 200
mask.head()

df1[mask].head()