<img src="nevin in suit pt 1.jpg" style="float: right; margin: 20px; height: 150px">

# AirBnb Analysis -> Capstone Project

_Author: Nevin Joshua Lyons_

---

>Basing my Capstone Project on the CRISP-DM model based on the Data Science Analytics Methodology. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. 


- **CR**oss
- **I**ndustry
- **S**tandard
- **P**rocess
- **D**ata
- **M**ining

## Requirements

1. An executive summary:
  - What is your goal?
  
The goal of the project is to conduct sentiment analysis on Airbnb guest reviews to identify factors influencing guest satisfaction and dissatisfaction. The objective is to provide actionable recommendations for listing improvement based on insights derived from the sentiment analysis

  - Where did you get your data?
  
Kaggle Source  

  - What are your metrics?
  
The primary metrics used for the sentiment analysis include sentiment polarity scores, which indicate the positivity or negativity of each review. Additionally, sentiment trends for different aspects of listings such as cleanliness, location, amenities, and host responsiveness are analyzed to identify areas for improvement.

  - What were your findings?
  
To be done soon

  - What risks/limitations/assumptions affect these findings?
  
**Data Quality**: The accuracy and reliability of sentiment analysis findings depend on the quality of the data, including the completeness and authenticity of guest reviews.

**Subjectivity**: Sentiment analysis is inherently subjective and may vary based on individual interpretation, cultural differences, and context. It's essential to acknowledge the inherent subjectivity in analyzing guest sentiment.

**Sample Bias**: The dataset may exhibit sample bias, with reviews being more likely to be submitted by guests with particularly positive or negative experiences. This bias could affect the generalizability of findings to all Airbnb guests.

**Assumptions**: The analysis assumes that guest reviews accurately reflect the overall guest experience and that addressing identified areas for improvement will lead to enhanced guest satisfaction and improved listing performance. However, there may be other factors influencing guest satisfaction that are not captured in the data.

2. Summarize your statistical analysis, including:
  - implementation
  - evaluation
  - inference
3. Clearly document and label each section of your notebook(s)
  - Logically organize your information in a persuasive, informative manner.
  - Include notebook headers and subheaders, as well as clearly formatted markdown for all written components.
  - Include graphs/plots/visualizations with clear labels.
  - Comment and explain the purpose of each major section/subsection of your code.
  - Document your code for your future self, as if another person needed to replicate your approach.
4. Clearly document all of your decision points in the relevant sections
  - How did you acquire your data?
  - How did you transform or engineer your data?  Why?
  - How did you select your model?
  - How did you optimize hyperparameters?
5. Host your notebook and any other materials in your own public Github Repository.
  - You repo should have README file that guides us through the repository and links to important files.
  - Include links and explanations to any outside libraries or source code used.
  - Host a copy of your dataset or include a link to a remotely hosted version.

**BONUS**

Create a medium post of at least 1,000 words summarizing your approach in a tutorial format and link to it in your notebook.  In your tutorial, address a slightly less technical audience. Think back to Day 1 of the program - how would you explain and walk through your capstone project to your earlier self?


Exploratory data analysis (EDA) is a crucial part of any data science project. EDA helps us discover interesting relationships in the data, detect outliers and errors, examine our own assumptions about the data, and prepare for modeling. During EDA we might discover that we need to clean our data more conscientiously, or that we have more missing data than we realized, or that there aren't many patterns in the data (indicating that modeling may be challenging.)

In this lab you'll bring in a natural language dataset and perform EDA. The dataset contains Facebook statuses taken from between 2009 and 2011 as well as personality test results associated with the users whose Facebook statuses are included.

This dataset uses results from the Big Five Personality Test, also referred to as the five-factor model, which measures a person's score on five dimensions of personality:
- **O**penness
- **C**onscientiousness
- **E**xtroversion
- **A**greeableness
- **N**euroticism

Notoriously, the political consulting group Cambridge Analytica claims to have predicted the personalities of Facebook users by using those users' data, with the goal of targeting them with political ads that would be particularly persuasive given their personality type. Cambridge Analytica claims to have considered 32 unique 'groups' in the following fashion:
- For each of the five OCEAN qualities, a user is categorized as either 'yes' or 'no'.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$).

Cambridge Analytica's methodology was then, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big Five personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

In this lab you will perform EDA to examine many relationships in the data.

Exploratory data analysis can be a non-linear process, and you're encouraged to explore questions that occur to you as you work through the notebook.

> **Content note**: This dataset contains real Facebook statuses scraped from 2009 to 2011, and some of the statuses contain language that is not safe for work, crude, or offensive. The full dataset is available as `mypersonality.csv`, and a sanitized version containing only statuses that passed an automated profanity check is available as `mypersonality_noprofanity.csv`. Please do not hesitate to use `mypersonality_noprofanity.csv` if you would prefer to. Please note that the automated profanity check is not foolproof. If you have any concerns about working with this dataset, please get in touch with your section lead.

---

### External resources

These resources are not required reading but may be of use or interest.

- [Python Graph Gallery](https://python-graph-gallery.com/)
- [Wikipedia page](https://en.wikipedia.org/wiki/Big_Five_personality_traits) on the Big Five test
- [A short (3-4 pages) academic paper](./celli-al_wcpr13.pdf) using the `MyPersonality` dataset to model personality

---


# Step 2: Cleaning the data Types of data need to clean:

- Missing Values
- Data Type Conversion
- Inconsistent Formatting
- Duplicate Data
- Inaccurate Data

## Load packages

In [4]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

---

## Load data

In [5]:
calendar = pd.read_csv('calendar.csv')
listings = pd.read_csv('listings.csv')
reviews = pd.read_csv('reviews.csv')


In [4]:
calendar.head(1)

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00


In [7]:
calendar.isnull().sum()

listing_id         0
date               0
available          0
price         459028
dtype: int64

In [5]:
listings.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,"Make your self at home in this charming one-bedroom apartment, centrally-located on the west side of Queen Anne hill. This elegantly-decorated, completely private apartment (bottom unit of a duplex) has an open floor plan, bamboo floors, a fully equipped kitchen, a TV, DVD player, basic cable, and a very cozy bedroom with a queen-size bed. The unit sleeps up to four (two in the bedroom and ...","Make your self at home in this charming one-bedroom apartment, centrally-located on the west side of Queen Anne hill. This elegantly-decorated, completely private apartment (bottom unit of a duplex) has an open floor plan, bamboo floors, a fully equipped kitchen, a TV, DVD player, basic cable, and a very cozy bedroom with a queen-size bed. The unit sleeps up to four (two in the bedroom and ...",none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07


In [8]:
listings.isnull().sum()

id                                    0
listing_url                           0
scrape_id                             0
last_scraped                          0
name                                  0
                                   ... 
cancellation_policy                   0
require_guest_profile_picture         0
require_guest_phone_verification      0
calculated_host_listings_count        0
reviews_per_month                   627
Length: 92, dtype: int64

In [6]:
reviews.head(1)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to everything!


In [10]:
reviews.isnull().sum()
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     84849 non-null  int64 
 1   id             84849 non-null  int64 
 2   date           84849 non-null  object
 3   reviewer_id    84849 non-null  int64 
 4   reviewer_name  84849 non-null  object
 5   comments       84831 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.9+ MB


---

## EDA on Statuses

Before we even vectorize the text, we might look at the lengths and word counts in each Facebook status.

#### Create a new column called `status_length` that contains the length of each status:

In [12]:
import pandas as pd

# Assuming you have a DataFrame named 'statuses_df' containing the Facebook statuses
# and the column 'status' containing the text of each status
# You can create the 'status_length' column as follows:

# Calculate the length of each status and store it in a new column
df['status_length'] = df['STATUS'].apply(len)

# Display the DataFrame to verify the new column
df.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,status_length,status_word_count
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,27,27
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she can't get to sleep.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,63,63
2,b7b7764cfa1c523e4e93ab2a79a946c4,"is sore and wants the knot of muscles at the base of her neck to stop hurting. On the other hand, YAY I'M IN ILLINOIS! <3",2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,121,121
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,42,42
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,11,11


In [2]:
import pandas as pd       
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re

In [7]:
sent_tokenize(reviews())

TypeError: 'DataFrame' object is not callable

#### Create a new column called `status_word_count` that contains the number of words in each status:

Note: You can evaluate this based off of how many strings are separated by whitespaces; you're not required to check that each set of characters set apart by whitespaces is a word in the dictionary.

In [11]:
df['status_word_count'] = df['STATUS'].apply(len)

# Display the DataFrame to verify the new column
print(df.head())

                            #AUTHID  \
0  b7b7764cfa1c523e4e93ab2a79a946c4   
1  b7b7764cfa1c523e4e93ab2a79a946c4   
2  b7b7764cfa1c523e4e93ab2a79a946c4   
3  b7b7764cfa1c523e4e93ab2a79a946c4   
4  b7b7764cfa1c523e4e93ab2a79a946c4   

                                                                                                                      STATUS  \
0                                                                                                likes the sound of thunder.   
1                                                            is so sleepy it's not even funny that's she can't get to sleep.   
2  is sore and wants the knot of muscles at the base of her neck to stop hurting. On the other hand, YAY I'M IN ILLINOIS! <3   
3                                                                                 likes how the day sounds in this new song.   
4                                                                                                                is home. <3 

### Longest and shortest statuses

Looking at individual observations can help us get a sense of what the dataset contains.

#### Show the five longest and five shortest statuses based off of `status_word_count`:

In [14]:
import pandas as pd

# Assuming your data is stored in a DataFrame named df
# Replace 'status_word_count' with the actual column name containing word counts
# Replace 'status' with the actual column name containing statuses

# Longest statuses
longest_statuses = df.nlargest(5, 'status_word_count')['STATUS']

# Shortest statuses
shortest_statuses = df.nsmallest(5, 'status_word_count')['STATUS']

print("Five longest statuses:")
print(longest_statuses)

print("\nFive shortest statuses:")
print(shortest_statuses)

Five longest statuses:
8026    Heh...:"God I wish that I could hide away//And find a wall to bang my brains//I'm living in a fantasy,//a nightmare dream...reality//People ride about all day//In metal boxes made away//I wish that they would drop the bomb//And kill these cunts//that don't belong! I hate people!//I hate the human race//I hate people!//I hate your ugly face//I hate people!//I hate your fucking mess//I hate peop...
7976    "I said he's a fairy I do suppose//flyin thru the air in pantyhose//he may be very sexy or even cute//but he looks like a sucka in a blue and red suit//I said you need a man who's got finesse//& his whole name across his chest//he may be able to fly all thru the night//but can he rock a party til the early light//he can't satisfy you with his little worm//but I can bust you out w my Super sper...
9651    - so, this morning *PROPNAME* gets up to play and he goes over to the carpet where the light was shining through the blinds casting shadows...he then pro

## What's the distribution of post lengths?

Use visuals to show the distributions of post lengths. Show both the distribution of word counts and the distribution of lengths based off character.

---

## EDA of Personality Scores

There are two sets of personality columns in the dataset: class and score. According to the attached paper, scores have been converted to categories based on whether a score for a user fell above or below the median.

### Plot the distributions of personality scores for all five score columns:

---

### How many unique users exist in the dataset?

This dataset has redacted original poster names, but each user is given an `#AUTHID`. How many unique users are there, and how many posts per user do we have?

## EDA on unique users

Because we have many posts per user for most users, doing EDA on the personality score columns might be misleading. If we have 200 Facebook statuses from one very high-conscientiousness user, a bar chart of how many `'cCON'` statuses are associated with `'y'` might be misleading. We'll have to be careful about labeling and titling any visualizations we make off of the dataset.

#### Create a new dataframe called `unique_users` that only contains the `#AUTHID`, personality score, and personality category columns:

If you do this correctly, it should have 250 rows and 11 columns.

(Hint: You can use the pandas [drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) method to make this easier. The only column you want to consider when deciding if a user is duplicated is the `#AUTHID` column.)

#### Plot the distribution of personality scores for `unique_users`:

Do the distributions look different? Here, each individual user will only be represented once.

#### Use the `.describe()` method on `unique_users`:

### Plots vs. Tables

Consider what different information is easily conveyed by the plots of scores, versus the table with summary statistics. Explain when you might present a distribution versus when you might present a table of summary statistics.

#### Other visualizations:

Create 1-2 additional visualizations related to the `unique_users` dataframe.

You might consider:
- Barcharts of users per category per trait
- A seaborn correlation heatmap
- A seaborn pairplot

---

## Exploring status length and word count based on personality

#### Using `groupby()`, find the mean status length and status word count for posts by users in the high and low categories of each of the Big 5 traits.

You'll need to use `groupby()` five separate times for this.

#### Choose one of the personality category columns (i.e. `cOPN`, `cCON`, `cEXT`, `cAGR`, or `cNEU`.) Use `sns.distplot()` to visualize the distribution of status word counts for posts by users who score both high (`y`) and low (`n`) in that personality category:

---

## EDA on Word Counts

### Vectorize the text

In order to perform EDA on word count data, we'll need to count-vectorize.

Create a dataframe that contains the count-vectorized text for each Facebook status in the dataset.

To do this, you might follow these steps:
- Instantiate a `CountVectorizer` object
- Fit the count vectorizer on the Facebook statuses
- Store the transformed data
- Convert to a dataframe and store
    - Don't forget that the transformed data will need to be 'densified'. The `toarray()` or `todense()` methods will allow this.
    - Don't forget that the `get_feature_names()` method on a fitted `CountVectorizer` object will bring you back the words learned from the dataset, which you can set as the `columns` argument when creating the dataframe.
    
It's up to you whether or not to keep stopwords in the dataset.

### Show the 15 most common words

### Show the 15 frequency of the most common words as a bar chart

**Hint**: You can do this in one line of code. [This webpage](https://dfrieds.com/data-visualizations/bar-plot-python-pandas.html) has an example.

### Investigating `propname`

The word `propname` shows up frequently in this dataset. Show the first 10 statuses in the dataset that contain `propname`:

#### Provide a short explanation of what you believe `propname` to be:

Hint: The attached PDF also contains an explanation.

## Most common words based on personality category

In order to do more targeted EDA, we'll need to be able to reference not only the dataframe of vectorized statuses, but also the personality scores from the original dataframe.

#### Create a new dataframe called `text_and_scores` that concatenates the count-vectorized statuses side-by-side with the original personality category columns:

#### Show the 25 most common words for statuses from high-cAGR users:

#### Show the 25 most common words for statuses from low-cAGR users:

### (BONUS) Most common bigrams:

This is a bonus section and not required.

Find the 10 most common bigrams in the dataset.

### (BONUS) Most common trigrams:

This is a bonus section and not required.

Find the 10 most common trigrams in the dataset.

---

## Choose your own adventure

By now you've looked at a lot of visualizations and frequency counts.

Come up with 2-3 questions about the data, and try to answer them using descriptive statistics (like counts, averages, etc.) or visualizations.

Some questions you might explore:
- Have numbers been redacted, or are phone numbers, house numbers, or zip codes anywhere in the dataset?
- `PROPNAME` has been used to redact personal names. Given that this data was scraped between 2009 and 2011, investigate if any public figures or famous people show up in the dataset, or their names have been redacted as well.
- Is count of uppercase letters vs. lowercase letters per status related to any personality category or personality score?
- Is _average_ word count per status related to any personality category or personality metric?
- Is punctuation use related to personality?

Or, of course, come up with your own questions to investigate!

The focus here is on "explore" -- you might not find anything of particular interest, but don't let that discourage you.

---

## Exploratory vs. Explanatory Data Analysis 

> **Exploratory analysis** is what you do to get familiar with the data. You may start out with a hypothesis or question, or you may just really be delving into the data to determine what might be interesting about it. Exploratory analysis is the process of turning over 100 rocks to find perhaps 1 or 2 precious gemstones.
>
> **Explanatory analysis** is what happens when you have something specific you want to show an audience - probably about those 1 or 2 precious gemstones. In my blogging and writing, I tend to focus mostly on this latter piece, explanatory analysis, when you've already gone through the exploratory analysis and from this have determined something specific you want to communicate to a given audience: in other words, when you want to tell a story with data.

- Cole Nussbaumer Knaflic, [exploratory vs. explanatory analysis](http://www.storytellingwithdata.com/blog/2014/04/exploratory-vs-explanatory-analysis)

### Choose one visual to explain:

Now that you've performed an exploratory data analysis, choose a visual (or 1-3 related visuals) to frame as _explanatory_. This can be a visual you created above, or you can create a new visual. For this visual, make sure the visuals are formatted clearly, and provide a one to two paragraph explanation/interpretation of the visual.