# Guessing the number: linear regression

Regression has a long history in statistics, from building simple but effective linear models of economic, psychological, social and political data. <br />

Regression is also a common tool in data science. Data science practitioners see linear regression as a simple, understandable, yet effective algorithm for estimations, and, in its logistic regression version, for classification as well. <br />

Linear regression is a statistical model that defines the relationship between a target variable and a set of predictive features. It does so using a formula: <br />

y = a + bx <br />

You can translate this formula into something readable and useful for many problems. For instance, if you are trying to guess your sales based on historical results and available data about advertising expenditures, the same preceding formula becomes <br />

sales = a + b * (advertising expenditure) <br />

You may already have encountered this formula during high school because its also the formula of a line in a bidimensional plane, which is made of an x axis and a y axis. <br />

If b is positive, y increases and decreases as x increases and decreases -- when b is negative, y behaves in the opposite manner. When the value of b is near zero, the effect of x on y is slight, but if the value of b is high, either positive or negative, the effect of changes in x on y are great. <br />

You can express this relationship graphically as the sum of the square of all the vertical distances between all the data points and the regression line. Such a sum is always the minimum possible when you calculate the regression line correctly using an estimation called ordinary least squares. The difference between the real y values and the regression line (the predicted y values) are defined as residuals. (errors)



## Using more variables

When using a single variable for predicting y, you use simple linear regression, but when working with many variables, you use multiple linear regression. When you have many variables, their scale isnt important in creating precise linear regression predictions. But as a good habit is to standardize X because the scale of the variables is quite important for some variants of regression. <br />

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. <br />

The following example relies on the Boston dataset from Scikit-learn. It tries to guess Boston housing prices using a linear regression. The example also tries to determine which variables influence the result more. So the example standardizes the predictors.

In [None]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
boston = load_boston()
# the last attribute (MEDV) is the target
print boston.DESCR

In [None]:
X, y = scale(boston.data), boston.target
print X.shape, y.shape

In [None]:
from sklearn.linear_model import LinearRegression
# if normalize True it normalizes the X
regression = LinearRegression(normalize=True)
regression.fit(X,y)

Now that the algorithm is fitted, you can use the score method to report the R2 measure, which is a measure that ranges from 0 to 1 and points out how using a particular regression model is better in predicting y than using a simple mean would be. You can also see R2 as being the quantity of target information explained by the model, so getting near 1 means being able to explain most of the y variable using the model.

In [None]:
print regression.score(X,y)

Calculating R2 on the same set of data used for the training is common in statistics. In data science and machine-learning, its always better to test scores on data that has not been used for training. Algorithms of greater complexity can memorize the data better than they learn from it, but this statement can be also true sometimes for simpler models, such as linear regression.

To understand what drives the estimates in the multiple regression model, you have to look at the coefficients_ attribute, which is an array containing the regression beta (b) coefficients. Printing at the same time, the boston.DESCR attribute helps you understand which variable the coefficients reference. The zip function will generate an iterable of both attributes, and you can print it for reporting. 

In [None]:
print [a+':'+str(round(b,1)) for a, b in zip(boston.feature_names, regression.coef_,)]

As you can see DIS has the most significant influence. DIS is the weighted distances to five employment centers. In real state, a house that is too far from people's interests (such as work) lowers the value.

In [None]:
print boston.DESCR

## Understanding limitations and potential problems


Although linear regression is simple yet effective estimation tool, it has quite a few problems. The problems can reduce the benefit of using linear regressions in some cases, but it really depends on the data. <br />

- Linear regression can model only quantitative data. When modeling categories as response, you need to modify the data into a logistic regression.
- If data is missing and you dont deal with it properly, the model stops working.
- Outliers are quite disruptive for a linear regression because linear regression tries to minimize the square value of the residuals, and outliers have big residuals.
- The relation between the target and each particular variable is based on single coefficient - there isnt automatic way to represnt complex relations like parabola (there is unique value of x maximizing y) or exponential growth.
- The greatest limitation is that linear regression provides a summation of terms, which can vary independently of each other. It is hard to figure out how to represent the effect of certain variables that affect the result in very different ways according to their value.

# Moving to Logistic Regression

Linear regression is well suited for estimating values, but it isnt the best tool for predicting the class of an observation. In spite of the statistical that advises against it, you can try to clasify a binary class by forcing one class as 1 and the other as 0. The results are disappointing most of teh time, so the statistical theory wasnt wrong! <br /><br />
The fact is that linear regression works on a continuum of numeric estimates. In order to classify correctly, you need a more suitable measure, such as the probability of class ownership. Thanks to following formula, you can transform a linear regression numeric estimate into a probability that is more apt to describe how a class fits on observation:   <br /><br />
probability of a class = exp(r) / (1 + exp(r)) <br /><br />
r is the regression result (a + bx) and we maximize the probability of a class based on the data we have. A linear regression using such a formula for transforming its results into probabilities is a logistic regression.<br />

https://datajobs.com/data-science-repo/Logistic-Regression-%5BPeng-et-al%5D.pdf

## Applying logistic regression

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
# these are characteristic data for each iris
print iris.data
# these are each iris type that above characteristic belongs to
print iris.target
# X is all the rows of data but the last one which used for testing
# y is all targets but the last one which is used for testing
X, y = iris.data[:-1,:], iris.target[:-1]

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X,y)
# testing the model on the last data and target we kept for testing
print 'Predicted class %s, real class %s' % (logistic.predict(iris.data[-1,:]),iris.target[-1])
print 'Probabilities for each class from 0 to 2: %s' % logistic.predict_proba(iris.data[-1,:])

Using probabilities let you guess the most probable class, but you can also order the predictions with respect to being part of that class. This is especifically useful for medical purposes. Ranking a prediction in terms of likelihood with respect to others can reveal what patients are at most risk of getting or already having a disease.

## Considering when classes are more than two

Most algorithms provided by Scikit-learn that predict probabilities or a score for class can automatically handle multiclass problems using two different strategies: <br/> 

<b>One versus rest:</b> The algorithm creates one model for each class<br /><br />
<b>One versus one:</b> The algorithm compares every class against every individual remaining class, building a number of models equivalent to n* (n - 1) /2, where n is the number of classes.<br />

In the case of logistic regression, the default multiclass strategy is the one versus the rest.

For following example: http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html

There are hand-written numbers each on 8 by 8 pixels. So the dataset is made up of 1797 8x8 images. Each pixel based on the coloring from complete white to complete black can go from 0 to 16. We want to predict new numbers.


In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
# each row has 64 columns (8 * 8) and 1797 rows
print digits.data
# the target is numbers from 0 to 9
print digits.target
# using 1700 rows for as training set
X, y = digits.data[:1700,:], digits.target[:1700]
# using 97 rows for testing
tX, ty = digits.data[1700:,:], digits.target[1700:]

In [None]:
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
logistic = LogisticRegression()
OVR = OneVsRestClassifier(logistic).fit(X,y)
OVO = OneVsOneClassifier(logistic).fit(X,y)
print 'One vs rest accuracy: %.3f' % OVR.score(tX,ty)
print 'One vs one accuracy: %.3f' % OVO.score(tX,ty)

In [None]:
LR = LogisticRegression()
LR.fit(X,y)
print 'One vs rest accuracy: %.3f' % LR.score(tX,ty)

Interestingly, the one-versus-one strategy obtained the best accuracy thanks to its high number of models in competition.

# Making Things as Simple as Naïve Bayes

<br /><b>First, Conditional Probability & Bayes' Rule</b><br /><br />
Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. 

Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.

Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is: (Probability of O occurring) multiplied by the (Prob of E given that O happened)

One Example to understand Conditional Probability:

Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.

If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.

Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.

  P(Democrat & Female) = P(Democrat) x P(Female / Democrat) 

We could compute the exact same thing, the reverse way:

  P(Democrat & Female) = P(Female) x P(Democrat / Female) 


<br /><b>Understanding Bayes Rule</b><br /><br />

Conceptually, this is a way to go from P(Evidence/ Known Outcome) to P(Outcome/Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence.

P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)

The classic example to understand Bayes' Rule:

Probability of Disease D given Test-positive = 

     Prob(Test is positive/Disease) *P(Disease) 
     _______________________________________________________
     (scaled by) Prob(Testing Positive, with or without the disease)
     
     
or simply:

     P(A|B) = P(B|A) * P(A) / P(B)

Now, all this was just preamble, to get to Naive Bayes.
	
<br /><b>Getting to Naive Bayes'</b><br /><br />

So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.

P(Outcome/Multiple Evidence) = 
P(Evidence1/Outcome) x P(Evidence2/outcome) x ... x P(EvidenceN/outcome) x P(Outcome)
scaled by P(Multiple Evidence)

Many people choose to remember this as:

P(outcome/evidence) = P(Likelihood of Evidence) x Prior prob of outcome
                      ___________________________________________
                           P(Evidence)

Notice a few things about this equation:

    If the Prob(evidence/outcome) is 1, then we are just multiplying by 1.
    If the Prob(some particular evidence/outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
    Since we divide everything by P(Evidence), we can even get away without calculating it.
    The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.

<br /><b>How to Apply NaiveBayes to Predict an Outcome?</b><br /><br />

Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity. Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.
Fruit Example

Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.

Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit. We know 3 characteristics about each fruit:

    Whether it is Long
    Whether it is Sweet and
    If its color is Yellow.

This is our 'training set.' We will use this to predict the type of any new fruit we encounter.

Type           Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total  <br />
             ___________________________________________________________________ <br />
Banana      |  400  |    100   || 350   |    150    ||  450   |  50      |  500  <br />
Orange      |    0  |    300   || 150   |    150    ||  300   |   0      |  300  <br />
Other Fruit |  100  |    100   || 150   |     50    ||   50   | 150      |  200  <br />
            ____________________________________________________________________ <br />
Total       |  500  |    500   || 650   |    350    ||  800   | 200      | 1000  <br />
             ___________________________________________________________________ <br />

We can pre-compute a lot of things about our fruit collection.

The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.

 P(Banana)  = 0.5 (500/1000)
 P(Orange)  = 0.3
 P(Other Fruit) = 0.2

Probability of "Evidence"

p(Long)  = 0.5
P(Sweet)  = 0.65
P(Yellow) = 0.8

Probability of "Likelihood"

P(Long/Banana) = 0.8
P(Long/Orange) = 0  [Oranges are never long in all the fruit we have seen.]
 ....

P(Yellow/Other Fruit) =  50/200 = 0.25
P(Not Yellow/Other Fruit)  = 0.75

<br /><b>Given a Fruit, how to classify it?</b><br /><br />

Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?

We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):

P(Banana/Long, Sweet and Yellow) = P(Long/Banana) P(Sweet/Banana) P(Yellow/Banana) x P(banana)
                                            __________________________________________________
                                                   P(Long). P(Sweet). P(Yellow) 

                                   0.8 x 0.7 x 0.9 x 0.5
                              =    ______________________ 
                                     P(evidence)

                          = 0.252/P(evidence)

P(Orange/Long, Sweet and Yellow) = 0

P(Other Fruit/Long, Sweet and Yellow) = P(Long/Other fruit) x P(Sweet/Other fruit) x P(Yellow/Other fruit) x P(Other Fruit)
                                     = (100/200 x 150/200 x 50/150 x 200/1000) / P(evidence)
                                     = 0.01875/P(evidence)

By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.


<br /><b>Why it is called Naive bayes?</b><br /><br />

If you look at example above on the right hand side of the equiation:

P(Long, Sweet, Yellow / Banana) x P(banana) -->  P(Long/Banana) P(Sweet/Banana) P(Yellow/Banana) x P(banana)

This means Naive bayes assumption is that those porbabilities are independent from each other. This is not the case always but this simplification make Naive bayes practical tool.

It is ok to use everything for prediction, even though it seems as though it shouldnt be okay given the strong association between variables. Here are some of the ways in which you commonly see Naive Bayes used: <br />

- Building Spam detectors
- Sentiment analysis (guessing wheather a text contains positive or negative attitutdes with respect to topic, and detecting the mood of the speaker)
- Text-processing tasks such as spell correction

Naive Bayes is also popular because it does not need as much data to work. It can naturally handle multiple classes. With some slight variable modifications, it can also handle numeric variables. Scikit-learn provides three Naive Bayes classes in the sklearn.naive_bayes module:

- MultinominalNB: Uses the probabilities derived from a feature'spresence. 
- BernolliNB: Provides the multinominal functionality of Naive Bayes but it penalizes the absence of a feature.
- GaussianNB: Defines a version of Naive Bayes that expects a normal distribution of all the features. If your variables have positive and negative values, this is the best choice.





## Performing the Hashing Trick

Scikit-learn provides you with most of the data structures and functionality you need to complete your data science project. There are even classes for the trickiest and most advanced problems.

For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. We talked about how to work with text by using the bag of words model. All these powerful transformations can operate properly only if all your text is known and available in the memory of your computer.

A more serious data science challenge is to analyze online-generated text flows, such as from social netwroks or large online text repositories. Hashing trick can give you quite a few advantages here.

### Using hash functions

Hash functions can transform any input into an output whose characteristics are predictable. The most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they are called deterministic functions. For example, input a word like dog and teh hashing function always returns the same number.

In a certain sense, hash functions are like secret code, transformin every thing into numbers. Unlike secret codes, however, you cannot convert the hashed code to its original value. In addition, in some rare cases, diffrent words generate the same hashed result (also called a hash collision).

### Demonstrating the hashing trick

There are many hash functions, with MD5 (often used to check file integrity, because you can hash the entire files) and SHA (used in cryptography) being the most popular. Python posseses a built-in hash function named hash. You can test Python hash:

In [None]:
hash('Python')

The Python session on your computer may return a different value than the one shown on the preceding line. When you need consistent output, rely on the Scikit-learn hash functions instead because the output is consistent across machines.

A Scikit-learn hash function can also return an index in a specific positive range. You can obtain something similar using a built-in hash by employing standard division and its remainder:

In [None]:
abs(hash('Python')) % 1000

To see how this works, pretend that you want to transform a text string from the internet into a numeric vector (a feature vector) so that you can use it for starting a machine learning project. A good strategy for managing this data science task is to employ one-hot-encoding, which produces a bag of words. Here are the steps for one-hot-encoding a string("Python for data science") into a vector.

- Assign a number to each word, for instance, Python=0 for=1 data=2 science=3.
- Initialize the vector, counting the number of unique words that you assigned a code in Step1.
- Use the codes assigned in Step1 as indices for populating the vector with values, assigning a 1 where there is a coincidence with a word existing in the phrase.

The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements. When suddenly a new phrase arrives and you must vectorize the following text as well: "Python for machine learning". Now you have two new words "machine learning" to work with. The following steps help you create the new vectors:

- Assign these new codes: machine=4 learning=5
- Enlarge the previous vector to include the new words: [1,1,1,1,0,0]
- Compute the vector for the new string: [1,1,0,0,1,1]

One-hot-encoding is quite optimal because it creates efficient and ordered feature vectors. Unfortunately, one-hot-encoding fails and becomes difficult to handle when your project experiences a lot of variablity with regards to its inputs. This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online environments can suddenly craete or add to your initial data. Using hash functions is a smarter way to handle unpredictability in your inputs.

In Python, you can define a simple hashing trick by creating a function and checking the results using the two test strings:


In [None]:
string_1 = 'Python for data science'
string_2 = 'Python for machine learning'

def hashing_trick(input_string, vector_size=20):
    feature_vector = [0] * vector_size
    for word in input_string.split(' '):
        index = abs(hash(word)) % vector_size
        feature_vector[index] = 1
    return feature_vector

print hashing_trick(input_string='Python for data science', vector_size=20)
print hashing_trick(input_string='Python for machine learning', vector_size=20)

When viewing feature vectors, you should notice that:

- You dont know where each word is located. When its important to be able to reverse the process of assigning words to indices, you must store the relationship between words and their hashed values seperately.
- For small values of the vector_size function parameter, many words overlap in the same positions in the least representing the feature vector. (Use greater vector_size to minimize the overlap)

The feature vectors in this example are made mostly of zero entries, representing a waste of memory when compared to the more memory-efficient one-hot-encoding. One of the ways in which you can solve this problem is to rely on sparse matrices.

### Working with deterministic selection

Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeros. Sparse matrices store just the coordinates of the cells and their values. Here is an example:

In [None]:
from scipy.sparse import csc_matrix
print csc_matrix([1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0])

As a data scientist, you dont have to worry about programming your own version of the hashing unless you would like some customization. Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data matrix using hashing trick. Here is an example:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
one_hot_encoder = CountVectorizer()
one_hot_encoded = one_hot_encoder.fit_transform(['Python for data science','Python for machine learning'])
# two rows for each memeber of list and total of 6 different words for each row
# it uses compressed sparse row so total of 8 stored values
one_hot_encoded

In [None]:
print one_hot_encoder.vocabulary_

As soon as new text arrives, CountVectorizer stops working

In [None]:
one_hot_encoded.transform(['New text has arrived'])

Using HashingVectorizer, there is always a place for new words in the data matrix. At worst, a word settles in an already occupied position, causing a word collision.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
sklearn_hashing_trick = HashingVectorizer(n_features=20, binary=True, norm=None)
hashed_text = sklearn_hashing_trick.transform(['Python for data science','Python for machine learning'])
hashed_text

In [None]:
sklearn_hashing_trick.transform(['New text has arrived'])

HashingVectorizer is the perfect function to use when your data cannot fit into memory and its features are not fixed. In the other cases, consider using the more intuitive CountVectorizer.

## Predicting text classifications using Naive Bayes

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
# one of the posts in training data
print newsgroups_train.data[0]
print "\n"
# print all target names for this training data 
print newsgroups_train.target_names
print "\n"
# print all the targets (categories of text) associated to each post. It corresponds to target_names
print newsgroups_train.target

In [None]:
print 'number of posts in training: %i' % len(newsgroups_train.data)
D={word:True for post in newsgroups_train.data for word in post.split(' ')}
print 'number of distinct words in training: %i' % len(D)
print 'number of posts in test: %i' % len(newsgroups_test.data)

In [None]:
# We set alpha values, which are useful for avoiding a zero probability for rare features (a zero probability would 
# exclude these features from the analysis)
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
Bernoulli = BernoulliNB(alpha=0.01)
Multinomial = MultinomialNB(alpha=0.01)

We can use two different hashing tricks, one counting the words (for the multinominal approach) and one recording wheather a word appeared in a binary variable (the binominal approach, all non zero counts are set to 1). You can also remove stop words, that is, common words found in the English language such as "a," "the," "in," and so on.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
multinomial_hashing_trick = HashingVectorizer(stop_words='english', binary=False, norm=None, non_negative=True)
binary_hashing_trick = HashingVectorizer(stop_words='english', binary=True, norm=None, non_negative=True)

In [None]:
Multinomial.fit(multinomial_hashing_trick.transform(newsgroups_train.data),newsgroups_train.target)
Bernoulli.fit(binary_hashing_trick.transform(newsgroups_train.data),newsgroups_train.target)
from sklearn.metrics import accuracy_score
for m,h in [(Bernoulli,binary_hashing_trick), (Multinomial,multinomial_hashing_trick)]:
    print 'Accuracy for %s: %.3f' % (m, accuracy_score(y_true=newsgroups_test.target, y_pred=m.predict(h.transform(newsgroups_test.data))))

# Exploring Lazy Learning with K-nearest Neighbors

## Predicting after observing neighbors


lazy because least amount of work.
Keep all records and their labels as traning when a new record comes find the k nearest records to those (euclidean distance) and deduct the similar label.

In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
digits = load_digits()
pca = PCA(n_components=25)
pca.fit(digits.data[:1700,:])
X, y = pca.transform(digits.data[:1700,:]), digits.target[:1700]
tX, ty = pca.transform(digits.data[1700:,:]), digits.target[1700:]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2.
kNN = KNeighborsClassifier(n_neighbors=5, p=2)
kNN.fit(X,y)

In [None]:
print 'Accuracy: %.3f' % kNN.score(tX,ty) 
print 'Prediction: %s actual: %s' % (kNN.predict(tX[:10,:]),ty[:10])

## Choosing wisely your k parameter

In [None]:
for k in [1,5,10,20,50,100,200]:
    kNN = KNeighborsClassifier(n_neighbors=k).fit(X,y)
    print 'for k=%3i accuracy is %.3f' % (k, kNN.score(tX,ty))