## Introduction
Xenophobia is the fear or hatred of that which is perceived to be foreign or strange. During COVID-19, many people faced xenophobia due to their appearence, nationality etc. It is important to identify these tweet earlier and prevent them from further spread.

## Exploratory Analysis

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


This dataset contains only one csv and without target variable. You are free to experiment in whatever way you think is suitable and necessary to look in order to identify xenophobia as soon as possible.


In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


The next hidden code cells define functions for plotting data. Click on the "Code" button in the published kernel to reveal the hidden code.

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


Now you're ready to read in the data and use the plotting functions to visualize the data.

### Let's check 1st file: /kaggle/input/Xenophobia.csv

In [None]:
sample = 10000
# Xenophobia.csv may have more rows in reality, but we are only loading/previewing the first 10000 rows
df1 = pd.read_csv('/kaggle/input/Xenophobia.csv', delimiter=',',nrows=sample,encoding='latin1')
df1.dataframeName = 'Xenophobia.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

In [None]:
df1.head(5)

Distribution graphs (histogram/bar graph) of sampled columns:

In [None]:
plotPerColumnDistribution(df1, 10, 5)

Correlation matrix:

In [None]:
plotCorrelationMatrix(df1, 8)

Scatter and density plots:

In [None]:
plotScatterMatrix(df1, 18, 10)

## Identify Xenophobic Tweets using Key Phrases

In [None]:
#Read whole data
df1 = pd.read_csv('/kaggle/input/Xenophobia.csv', delimiter=',',encoding='latin1')
df1.dataframeName = 'Xenophobia.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

In [None]:
#reference: https://medium.com/sculpt/xenophobic-tweets-78a9b316635
search_terms = ["we need to enforce our borders", "learn to speak english",
                "are terrorists","where are you really from", "are spreading virus",
                "i’m not like other", "illegal alien", "illegal immigrant",
                    "illegal immigration", "send her back", "send them back", "illegal criminal"]

In [None]:
def xenophobic_match(tweet):
    """
    Check that tweet has one of the search queries used.
    """
    for s in search_terms:
        if s in tweet.lower():
            return True
    return False

In [None]:
df1['text']=df1['text'].astype(str)
df1["target"]=df1.apply(lambda x : xenophobic_match(x.text), axis=1)
df1['target'].value_counts()

## Conclusion
Just by using some key phrases that signifies xenophobia, we are able to identify 831 tweets that belongs to xenophobia. There is much potential to identify these tweets with proper phrasing. Once tweets are categorized, this data can further be anaylyzed based on location.